封面图片

百页机器学习书

The Hundred-Page Machine Learning Book

百页机器学习书

The Hundred-Page Machine Learning Book





版权所有 © 2019 安德烈·布尔科夫



保留所有权利。本书按照“先读,后买”的原则发行。后者意味着任何人都可以通过任何可用的方式获得这本书的副本,阅读它并与其他人分享。但是,如果您阅读并喜欢这本书,或者发现它有任何帮助或有用,您就必须购买它。如需更多信息,请发送电子邮件至author@themlbook.com





Copyright © 2019 Andriy Burkov



All rights reserved. This book is distributed on the “read first, buy later” principle. The latter implies that anyone can obtain a copy of the book by any means available, read it and share it with anyone else. However, if you read and liked the book, or found it helpful or useful in any way, you have to buy it. For further information, please email author@themlbook.com.




致我的父母:

塔蒂安娜和瓦莱里



,致我的家人:

女儿凯瑟琳和伊娃,

以及兄弟德米特里



“所有模型都是错误的,但有些模型是有用的。”

乔治·博克斯



“如果我有更多时间,我会写一封更短的信。”

布莱斯·帕斯卡

“All models are wrong, but some are useful.”

George Box



“If I had more time, I would have written a shorter letter.”

Blaise Pascal




该书的发行遵循“先读,后买”的原则。

前言

Foreword

过去二十年来,海量数据的可用性呈爆炸式增长,相应地,统计和机器学习应用也引起了人们的兴趣。影响是深远的。十年前,当我能够吸引全班 MBA 学生参加我的新统计学习选修课时,我的同事们感到惊讶,因为我们部门很难填补大多数选修课。今天,我们提供商业分析硕士学位,这是学校最大的专业硕士课程,其申请量可与我们的 MBA 课程相媲美。我们的课程数量大幅增加,但学生仍然抱怨课程已满。我们的经验并不独特,随着对这一领域培训的个人需求的不断增长,数据科学和机器学习项目以惊人的速度涌现。

The last twenty years have witnessed an explosion in the availability of enormous quantities of data and, correspondingly, of interest in statistical and machine learning applications. The impact has been profound. Ten years ago, when I was able to attract a full class of MBA students to my new statistical learning elective, my colleagues were astonished because our department struggled to fill most electives. Today we offer a Master’s in Business Analytics, which is the largest specialized master’s program in the school and has application volume rivaling those of our MBA programs. Our course offerings have increased dramatically, yet our students still complain that the classes are all full. Our experience is not unique, with data science and machine learning programs springing up at an extraordinary rate as the demand for individuals trained in this area has blossomed.

这种需求是由一个简单但不可否认的事实驱动的。机器学习方法在社会科学、商业、生物学和医学等众多领域产生了重要的新见解。因此,对具有必要技能的个人的需求巨大。然而,培训学生这些技能一直具有挑战性,因为大多数关于这些方法的早期文献都是针对学术界的,并且集中于拟合算法或结果估计器的统计和理论特性。对于在现实世界问题上实施特定方法时需要帮助的研究人员和从业者来说,几乎没有支持。这些人需要了解可应用于每个问题的方法范围,以及他们的假设、优点和缺点。但拟合算法的理论特性或详细信息远没有那么重要。当我们撰写《R 统计学习简介》(ISLR) 时,我们的目标是为该小组提供资源。收到的热情表明了社区中存在的需求。

This demand is driven by a simple, but undeniable, fact. Machine learning approaches have produced significant new insights in numerous settings such as the social sciences, business, biology and medicine, to name just a few. As a result, there is a tremendous demand for individuals with the requisite skill set. However, training students in these skills has been challenging because most of the early literature on these methods was aimed at academics and concentrated on statistical and theoretical properties of the fitting algorithms or resulting estimators. There was little support for researchers and practitioners who needed help in implementing a given method on real-world problems. These individuals needed to understand the range of methods that can be applied to each problem, along with their assumptions, strengths and weaknesses. But theoretical properties or detailed information on the fitting algorithms were far less important. Our goal when we wrote “An Introduction to Statistical Learning with R” (ISLR) was to provide a resource for this group. The enthusiasm with which it was received demonstrates the demand that exists within the community.

《百页机器学习书》遵循类似的范式。与 ISLR 一样,它跳过了理论推导,有利于为读者提供有关如何实施各种方法的关键细节。这是一本紧凑的“如何进行数据科学”手册,我预测它将成为学者和从业者的首选资源。这本书有 100 页(或多一点),足够短,可以一次读完。然而,尽管篇幅较长,它涵盖了所有主要的机器学习方法,从经典的线性回归和逻辑回归,到现代的支持向量机、深度学习、Boosting 和随机森林。各种方法也不乏详细信息,感兴趣的读者可以通过创新的配套书籍 wiki 获得有关任何特定方法的更多信息。这本书不假设任何高水平的数学或统计培训,甚至编程经验,因此几乎任何愿意花时间学习这些方法的人都应该可以阅读。对于任何在该领域开始攻读博士学位课程的人来说,这当然是必读的,并且将作为他们进一步进步的有用参考。最后,本书使用 Python 代码(最流行的机器学习编码语言之一)说明了一些算法。我强烈推荐《百页机器学习书》给想要了解更多机器学习知识的初学者,以及寻求扩展知识库的经验丰富的从业者。

“The Hundred-Page Machine Learning Book” follows a similar paradigm. As with ISLR, it skips involved theoretical derivations in favor of providing the reader with key details on how to implement the various approaches. This is a compact “how to do data science” manual and I predict it will become a go to resource for academics and practitioners alike. At 100 pages (or a little more), the book is short enough to read in a single sitting. Yet, despite its length, it covers all the major machine learning approaches, ranging from classical linear and logistic regression, through to modern support vector machines, deep learning, boosting, and random forests. There is also no shortage of details on the various approaches and the interested reader can gain further information on any particular method via the innovative companion book wiki. The book does not assume any high level mathematical or statistical training, or even programming experience, so should be accessible to almost anyone willing to invest the time to learn about these methods. It should certainly be required reading for anyone starting a PhD program in this area and will serve as a useful reference as they progress further. Finally, the book illustrates some of the algorithms using Python code, one of the most popular coding languages for machine learning. I would highly recommend “The Hundred-Page Machine Learning Book” for both the beginner looking to learn more about machine learning, and the experienced practitioner seeking to extend their knowledge base.

Gareth James,南加州大学数据科学与运维教授,(与 Witten、Hastie 和 Tibshirani)合着了畅销书《统计学习简介及其在 R 中的应用》

Gareth James, Professor of Data Sciences and Operations at University of Southern California, co-author (with Witten, Hastie and Tibshirani), of the best-selling book An Introduction to Statistical Learning, with Applications in R

前言

Preface

让我们先说实话:机器不会学习。典型的“学习机器”的作用是找到一个数学公式,当将该公式应用于输入集合(称为“训练数据”)时,会产生所需的输出。该数学公式还可以为大多数其他输入(与训练数据不同)生成正确的输出,前提是这些输入来自与训练数据来源相同或相似的统计分布。

Let’s start by telling the truth: machines don’t learn. What a typical “learning machine” does, is finding a mathematical formula, which, when applied to a collection of inputs (called “training data”), produces the desired outputs. This mathematical formula also generates the correct outputs for most other inputs (distinct from the training data) on the condition that those inputs come from the same or a similar statistical distribution as the one the training data was drawn from.

为什么这不是学习?因为如果你稍微扭曲输入,输出很可能会变得完全错误。动物的学习并不是这样进行的。如果您通过直视屏幕学会了玩视频游戏,那么即使有人稍微旋转屏幕,您仍然会成为一名优秀的玩家。如果机器学习算法是通过直视屏幕来训练的,除非它也经过训练来识别旋转,否则将无法在旋转的屏幕上玩游戏。

Why isn’t that learning? Because if you slightly distort the inputs, the output is very likely to become completely wrong. It’s not how learning in animals works. If you learned to play a video game by looking straight at the screen, you would still be a good player if someone rotates the screen slightly. A machine learning algorithm, if it was trained by “looking” straight at the screen, unless it was also trained to recognize rotation, will fail to play the game on a rotated screen.

那么为什么叫“机器学习”呢?通常情况下,原因在于营销:美国电脑游戏和人工智能领域的先驱阿瑟·塞缪尔 (Arthur Samuel) 于 1959 年在 IBM 工作时创造了这个术语。与 2010 年代 IBM 试图推销“认知计算”一词以在竞争中脱颖而出类似,在 1960 年代,IBM 使用新的酷术语“机器学习”来吸引客户和有才华的员工。

So why the name “machine learning” then? The reason, as is often the case, is marketing: Arthur Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term in 1959 while at IBM. Similarly to how in the 2010s IBM tried to market the term “cognitive computing” to stand out from competition, in the 1960s, IBM used the new cool term “machine learning” to attract both clients and talented employees.

正如你所看到的,就像人工智能不是智能一样,机器学习也不是学习。然而,机器学习是一个普遍认可的术语,通常指的是构建能够执行各种有用操作而无需明确编程的机器的科学和工程。因此,该术语中的“学习”一词是类比动物的学习,而不是字面意思。

As you can see, just like artificial intelligence is not intelligence, machine learning is not learning. However, machine learning is a universally recognized term that usually refers to the science and engineering of building machines capable of doing various useful things without being explicitly programmed to do so. So, the word “learning” in the term is used by analogy with the learning in animals rather than literally.

这本书适合谁

Who This Book is For

本书仅包含自 20 世纪 60 年代以来开发的大量机器学习材料中已被证明具有重大实用价值的部分。机器学习的初学者将在本书中找到足够的细节,以轻松地理解该领域并开始提出正确的问题。

This book contains only those parts of the vast body of material on machine learning developed since the 1960s that have proven to have a significant practical value. A beginner in machine learning will find in this book just enough details to get a comfortable level of understanding of the field and start asking the right questions.

有经验的从业者可以将本书作为进一步自我提升的指南集。当您在项目开始时进行头脑风暴时,当您尝试回答给定的技术或业务问题是否是“机器可学习的”问题以及如果是的话,您应该尝试使用哪些技术来解决它时,这本书也会派上用场。

Practitioners with experience can use this book as a collection of directions for further self-improvement. The book also comes in handy when brainstorming at the beginning of a project, when you try to answer the question whether a given technical or business problem is “machine-learnable” and, if yes, which techniques you should try to solve it.

如何使用本书

How to Use This Book

如果你即将开始学习机器学习,你应该从头到尾阅读这本书。 (只有一百页,没什么大不了的。)如果您对书中涵盖的特定主题感兴趣并想了解更多信息,大多数部分都有二维码。

If you are about to start learning machine learning, you should read this book from the beginning to the end. (It’s just a hundred pages, not a big deal.) If you are interested in a specific topic covered in the book and want to know more, most sections have a QR code.

通过用手机扫描其中一个二维码,您将获得该书配套维基theMLbook.com页面的链接,其中包含其他材料:推荐读物、视频、问答、代码片段、教程和其他奖励。本书的 wiki 不断更新,内容来自本书作者本人以及来自世界各地的志愿者的贡献。所以这本书就像一杯好酒,买了之后越喝越香。

By scanning one of those QR codes with your phone, you will get a link to a page on the book’s companion wiki theMLbook.com with additional materials: recommended reads, videos, Q&As, code snippets, tutorials, and other bonuses. The book’s wiki is continuously updated with contributions from the book’s author himself as well as volunteers from all over the world. So this book, like a good wine, keeps getting better after you buy it.

扫描下面的二维码即可访问本书的wiki:

Scan the QR code below to get to the book’s wiki:

有些部分没有二维码,但它们很可能仍然有一个 wiki 页面。您可以通过向 wiki 的搜索引擎提交该部分的标题来找到它。

Some sections don’t have a QR code, but they still most likely have a wiki page. You can find it by submitting the section’s title to the wiki’s search engine.

你应该买这本书吗?

Should You Buy This Book?

本书按照“先读,后买”的原则发行。我坚信,在消费内容之前付费就等于买了一头猪。在购买之前,您可以在经销商处查看并试车。您可以在百货商店试穿衬衫或连衣裙。在付费之前,您必须能够阅读一本书。

This book is distributed on the “read first, buy later” principle. I firmly believe that paying for the content before consuming it is buying a pig in a poke. You can see and try a car in a dealership before you buy it. You can try on a shirt or a dress in a department store. You have to be able to read a book before paying for it.

先读后买的原则意味着您可以免费下载本书、阅读并与您的朋友和同事分享。只有当您阅读并喜欢这本书,或者发现它在任何方面有帮助或有用时,您才必须购买它。

The read first, buy later principle implies that you can freely download the book, read it and share it with your friends and colleagues. Only if you read and liked the book, or found it helpful or useful in any way, you have to buy it.

现在一切都准备好了。祝您阅读愉快!

Now you are all set. Enjoy your reading!

1简介

1 Introduction

1.1什么是机器学习

1.1 What is Machine Learning

机器学习是计算机科学的一个子领域,涉及构建算法,这些算法的有用性依赖于某些现象示例的集合。这些例子可以来自大自然、由人类手工制作或由其他算法生成。

Machine learning is a subfield of computer science that is concerned with building algorithms which, to be useful, rely on a collection of examples of some phenomenon. These examples can come from nature, be handcrafted by humans or generated by another algorithm.

机器学习也可以定义为通过以下方式解决实际问题的过程:1)收集数据集,2)基于该数据集通过算法构建统计模型。假设该统计模型以某种方式用于解决实际问题。

Machine learning can also be defined as the process of solving a practical problem by 1) gathering a dataset, and 2) algorithmically building a statistical model based on that dataset. That statistical model is assumed to be used somehow to solve the practical problem.

为了减少击键次数,我交替使用术语“学习”和“机器学习”。

To save keystrokes, I use the terms “learning” and “machine learning” interchangeably.

1.2学习类型

1.2 Types of Learning

学习可以是监督学习、半监督学习、无监督学习和强化学习。

Learning can be supervised, semi-supervised, unsupervised and reinforcement.

1.2.1监督学习

1.2.1 Supervised Learning

监督学习1中,数据集是标记示例的集合 {𝐱,y}=1\{(\mathbf{x}_i, y_i)\}_{i=1}^N。每个元素𝐱\mathbf{x}_i之中称为特征向量。特征向量是一个向量,其中每个维度j=1,……,Dj=1,\ldots,D包含一个以某种方式描述示例的值。该值称为特征,表示为Xjx^{(j)}。例如,如果每个示例𝐱\mathbf{x}在我们的集合中代表一个人,那么第一个特征,X1x^{(1)},可以包含以厘米为单位的高度,第二个特征,X2x^{(2)},可以包含以公斤为单位的重量,X3x^{(3)}可以包含性别等。对于数据集中的所有示例,位置处的特征jj特征向量中总是包含相同类型的信息。这意味着如果X2x_i^{(2)}在某些示例中包含以千克为单位的重量𝐱\mathbf{x}_i, 然后Xk2x_k^{(2)}每个示例中还将包含以千克为单位的重量𝐱k\mathbf{x}_k,k=1,……,k=1,\l点,N。标签 y可以是属于有限类集的元素 {1,2,……,C}\{1,2,\l点,C\},或实数,或更复杂的结构,如向量、矩阵、树或图。除非另有说明,本书中y是有限类集合之一或实数2。您可以将类视为示例所属的类别。例如,如果您的示例是电子邮件,而您的问题是垃圾邮件检测,那么您有两个类{spA,nt_spA}\{垃圾邮件,不是\_垃圾邮件\}

In supervised learning1, the dataset is the collection of labeled examples {(𝐱i,yi)}i=1N\{(\mathbf{x}_i, y_i)\}_{i=1}^N. Each element 𝐱i\mathbf{x}_i among NN is called a feature vector. A feature vector is a vector in which each dimension j=1,,Dj=1,\ldots,D contains a value that describes the example somehow. That value is called a feature and is denoted as x(j)x^{(j)}. For instance, if each example 𝐱\mathbf{x} in our collection represents a person, then the first feature, x(1)x^{(1)}, could contain height in cm, the second feature, x(2)x^{(2)}, could contain weight in kg, x(3)x^{(3)} could contain gender, and so on. For all examples in the dataset, the feature at position jj in the feature vector always contains the same kind of information. It means that if xi(2)x_i^{(2)} contains weight in kg in some example 𝐱i\mathbf{x}_i, then xk(2)x_k^{(2)} will also contain weight in kg in every example 𝐱k\mathbf{x}_k, k=1,,Nk=1,\ldots,N. The label yiy_i can be either an element belonging to a finite set of classes {1,2,,C}\{1,2,\ldots,C\}, or a real number, or a more complex structure, like a vector, a matrix, a tree, or a graph. Unless otherwise stated, in this book yiy_i is either one of a finite set of classes or a real number2. You can see a class as a category to which an example belongs. For instance, if your examples are email messages and your problem is spam detection, then you have two classes {spam,not_spam}\{spam, not\_spam\}.

监督学习算法的目标是使用数据集生成采用特征向量的模型𝐱\mathbf{x}作为输入和输出信息,可以推断出该特征向量的标签。例如,使用人员数据集创建的模型可以将描述人员的特征向量作为输入,并输出该人员患有癌症的概率。

The goal of a supervised learning algorithm is to use the dataset to produce a model that takes a feature vector 𝐱\mathbf{x} as input and outputs information that allows deducing the label for this feature vector. For instance, the model created using the dataset of people could take as input a feature vector describing a person and output a probability that the person has cancer.

1.2.2无监督学习

1.2.2 Unsupervised Learning

无监督学习中,数据集是未标记示例的集合。 {𝐱}=1\{\mathbf{x}_i\}_{i=1}^N。再次,𝐱\mathbf{x}是一个特征向量,无监督学习算法的目标是创建一个采用特征向量的模型𝐱\mathbf{x}作为输入,并将其转换为另一个向量或可用于解决实际问题的值。例如,在聚类中,模型返回数据集中每个特征向量的聚类 ID。在降维中,模型的输出是特征向量,其特征少于输入。𝐱\mathbf{x};在异常值检测中,输出是一个实数,表明如何𝐱\mathbf{x}与数据集中的“典型”示例不同。

In unsupervised learning, the dataset is a collection of unlabeled examples {𝐱i}i=1N\{\mathbf{x}_i\}_{i=1}^N. Again, 𝐱\mathbf{x} is a feature vector, and the goal of an unsupervised learning algorithm is to create a model that takes a feature vector 𝐱\mathbf{x} as input and either transforms it into another vector or into a value that can be used to solve a practical problem. For example, in clustering, the model returns the id of the cluster for each feature vector in the dataset. In dimensionality reduction, the output of the model is a feature vector that has fewer features than the input 𝐱\mathbf{x}; in outlier detection, the output is a real number that indicates how 𝐱\mathbf{x} is different from a “typical” example in the dataset.

1.2.3半监督学习

1.2.3 Semi-Supervised Learning

半监督学习中,数据集包含标记和未标记的示例。通常,未标记示例的数量远高于标记示例的数量。半监督学习算法的目标与监督学习算法的目标相同。这里的希望是,使用许多未标记的示例可以帮助学习算法找到(我们可能会说“产生”或“计算”)更好的模型。

In semi-supervised learning, the dataset contains both labeled and unlabeled examples. Usually, the quantity of unlabeled examples is much higher than the number of labeled examples. The goal of a semi-supervised learning algorithm is the same as the goal of the supervised learning algorithm. The hope here is that using many unlabeled examples can help the learning algorithm to find (we might say “produce” or “compute”) a better model.

添加更多未标记的示例可以使学习受益,这看起来可能违反直觉。看来我们给问题增加了更多的不确定性。但是,当您添加未标记的示例时,您会添加有关问题的更多信息:更大的样本可以更好地反映我们标记的数据来自的概率分布。理论上,学习算法应该能够利用这些附加信息。

It could look counter-intuitive that learning could benefit from adding more unlabeled examples. It seems like we add more uncertainty to the problem. However, when you add unlabeled examples, you add more information about your problem: a larger sample reflects better the probability distribution the data we labeled came from. Theoretically, a learning algorithm should be able to leverage this additional information.

1.2.4强化学习

1.2.4 Reinforcement Learning

强化学习是机器学习的一个子领域,机器“生活”在一个环境中,并且能够将该环境的状态感知为特征向量。机器可以在每种状态下执行操作。不同的行为会带来不同的奖励,也可能使机器进入另一种环境状态。强化学习算法的目标是学习策略

Reinforcement learning is a subfield of machine learning where the machine “lives” in an environment and is capable of perceiving the state of that environment as a vector of features. The machine can execute actions in every state. Different actions bring different rewards and could also move the machine to another state of the environment. The goal of a reinforcement learning algorithm is to learn a policy.

策略是一个函数(类似于监督学习中的模型),它将状态的特征向量作为输入并输出在该状态下执行的最佳操作。如果该行动使预期平均奖励最大化,则该行动是最优的。

A policy is a function (similar to the model in supervised learning) that takes the feature vector of a state as input and outputs an optimal action to execute in that state. The action is optimal if it maximizes the expected average reward.

强化学习解决特定类型的问题,其中决策是连续的,并且目标是长期的,例如游戏、机器人、资源管理或物流。在本书中,我强调一次性决策,其中输入示例与过去做出的预测相互独立。我将强化学习排除在本书的讨论范围之外。

Reinforcement learning solves a particular kind of problem where decision making is sequential, and the goal is long-term, such as game playing, robotics, resource management, or logistics. In this book, I put emphasis on one-shot decision making where input examples are independent of one another and the predictions made in the past. I leave reinforcement learning out of the scope of this book.

1.3监督学习如何运作

1.3 How Supervised Learning Works

在本节中,我将简要解释监督学习的工作原理,以便您在详细介绍之前了解整个过程。我决定使用监督学习作为示例,因为它是实践中最常用的机器学习类型。

In this section, I briefly explain how supervised learning works so that you have the picture of the whole process before we go into detail. I decided to use supervised learning as an example because it’s the type of machine learning most frequently used in practice.

监督学习过程从收集数据开始。监督学习的数据是对(输入、输出)的集合。输入可以是任何内容,例如电子邮件、图片或传感器测量结果。输出通常是实数或标签(例如“spam”、“not_spam”、“cat”、“dog”、“mouse”等)。在某些情况下,输出是向量(例如,图片上人物周围矩形的四个坐标)、序列(例如输入“big beautiful car”的[“形容词”、“形容词”、“名词”]),或者有一些其他的结构。

The supervised learning process starts with gathering the data. The data for supervised learning is a collection of pairs (input, output). Input could be anything, for example, email messages, pictures, or sensor measurements. Outputs are usually real numbers, or labels (e.g. “spam”, “not_spam”, “cat”, “dog”, “mouse”, etc). In some cases, outputs are vectors (e.g., four coordinates of the rectangle around a person on the picture), sequences (e.g. [“adjective”, “adjective”, “noun”] for the input “big beautiful car”), or have some other structure.

假设您想要使用监督学习解决的问题是垃圾邮件检测。您收集数据,例如 10,000 封电子邮件,每封电子邮件都带有“垃圾邮件”或“非垃圾邮件”标签(您可以手动添加这些标签或付费请人为您添加这些标签)。现在,您必须将每封电子邮件转换为特征向量。

Let’s say the problem that you want to solve using supervised learning is spam detection. You gather the data, for example, 10,000 email messages, each with a label either “spam” or “not_spam” (you could add those labels manually or pay someone to do that for you). Now, you have to convert each email message into a feature vector.

数据分析师根据他们的经验决定如何将现实世界的实体(例如电子邮件)转换为特征向量。将文本转换为特征向量(称为词袋)的一种常见方法是采用英语单词词典(假设它包含 20,000 个按字母顺序排序的单词)并在我们的特征向量中规定:

The data analyst decides, based on their experience, how to convert a real-world entity, such as an email message, into a feature vector. One common way to convert a text into a feature vector, called bag of words, is to take a dictionary of English words (let’s say it contains 20,000 alphabetically sorted words) and stipulate that in our feature vector:

  • 第一个特征等于11如果电子邮件包含单词“a”;否则,此功能是00;
  • the first feature is equal to 11 if the email message contains the word “a”; otherwise, this feature is 00;
  • 第二个特征等于11如果电子邮件包含单词“aaron”;否则,该特征等于00;
  • the second feature is equal to 11 if the email message contains the word “aaron”; otherwise, this feature equals 00;
  • ……
  • 位置 20,000 处的特征等于11如果电子邮件包含单词“zulu”;否则,该特征等于00
  • the feature at position 20,000 is equal to 11 if the email message contains the word “zulu”; otherwise, this feature is equal to 00.

您对我们集合中的每封电子邮件重复上述过程,这为我们提供了 10,000 个特征向量(每个向量的维数为 20,000)和一个标签(“spam”/“not_spam”)。

You repeat the above procedure for every email message in our collection, which gives us 10,000 feature vectors (each vector having the dimensionality of 20,000) and a label (“spam”/“not_spam”).

现在您有了机器可读的输入数据,但输出标签仍然采用人类可读文本的形式。一些学习算法需要将标签转换为数字。例如,某些算法需要像这样的数字00(代表标签“not_spam”)和11(代表“垃圾邮件”标签)。我用来说明监督学习的算法称为支持向量机(SVM)。该算法要求正面标签(在我们的例子中是“垃圾邮件”)的数值为+1+1(一),负标签(“not_spam”)的值为-1-1(减一)。

Now you have machine-readable input data, but the output labels are still in the form of human-readable text. Some learning algorithms require transforming labels into numbers. For example, some algorithms require numbers like 00 (to represent the label “not_spam”) and 11 (to represent the label “spam”). The algorithm I use to illustrate supervised learning is called Support Vector Machine (SVM). This algorithm requires that the positive label (in our case it’s “spam”) has the numeric value of +1+1 (one), and the negative label (“not_spam”) has the value of 1-1 (minus one).

此时,您已经有了一个数据集和一个学习算法,因此您准备将学习算法应用于数据集以获得模型

At this point, you have a dataset and a learning algorithm, so you are ready to apply the learning algorithm to the dataset to get the model.

SVM 将每个特征向量视为高维空间中的一个点(在我们的例子中,空间是 20,000 维)。该算法将所有特征向量放在一个假想的 20,000 维图上,并绘制一条假想的 19,999 维线(超平面),将具有正标签的示例与具有负标签的示例分开。在机器学习中,分隔不同类示例的边界称为决策边界

SVM sees every feature vector as a point in a high-dimensional space (in our case, space is 20,000-dimensional). The algorithm puts all feature vectors on an imaginary 20,000-dimensional plot and draws an imaginary 19,999-dimensional line (a hyperplane) that separates examples with positive labels from examples with negative labels. In machine learning, the boundary separating the examples of different classes is called the decision boundary.

超平面的方程由两个参数给出,一个实值向量𝐰\mathbf{w}与我们的输入特征向量具有相同的维度𝐱\mathbf{x},和一个实数像这样:

The equation of the hyperplane is given by two parameters, a real-valued vector 𝐰\mathbf{w} of the same dimensionality as our input feature vector 𝐱\mathbf{x}, and a real number bb like this:

𝐰𝐱-=0,\mathbf{w}\mathbf{x} - b = 0,

𝐰𝐱b=0, \mathbf{w}\mathbf{x} - b = 0,

其中表达式𝐰𝐱\mathbf{w}\mathbf{x}方法w1X1+w2X2+……+wDXDw^{(1)}x^{(1)} + w^{(2)}x^{(2)} + \ldots + w^{(D)}x^{(D)}, 和DD是特征向量的维数𝐱\mathbf{x}

where the expression 𝐰𝐱\mathbf{w}\mathbf{x} means w(1)x(1)+w(2)x(2)++w(D)x(D)w^{(1)}x^{(1)} + w^{(2)}x^{(2)} + \ldots + w^{(D)}x^{(D)}, and DD is the number of dimensions of the feature vector 𝐱\mathbf{x}.

(如果您现在还不清楚某些方程,在第 2 章中,我们将重新审视理解它们所需的数学和统计概念。目前,请尝试直观地了解这里发生的情况。阅读​​完后,一切都会变得更加清晰下一章。)

(If some equations aren’t clear to you right now, in Chapter 2 we revisit the math and statistical concepts necessary to understand them. For the moment, try to get an intuition of what’s happening here. It all becomes more clear after you read the next chapter.)

现在,一些输入特征向量的预测标签𝐱\mathbf{x}是这样给出的:

Now, the predicted label for some input feature vector 𝐱\mathbf{x} is given like this:

y=符号𝐰𝐱-,y = \operatorname{sign}(\mathbf{w}\mathbf{x} - b),

y=sign(𝐰𝐱b), y = \operatorname{sign}(\mathbf{w}\mathbf{x} - b),

在哪里符号\运算符名称{符号}是一个数学运算符,它接受任何值作为输入并返回+1+1如果输入是正数或-1-1如果输入是负数。

where sign\operatorname{sign} is a mathematical operator that takes any value as input and returns +1+1 if the input is a positive number or 1-1 if the input is a negative number.

学习算法(本例中为 SVM)的目标是利用数据集并找到最佳值𝐰*\mathbf{w}^**^*对于参数𝐰\mathbf{w}。一旦学习算法识别出这些最佳值,模型就会 F𝐱f(\mathbf{x})然后定义为:

The goal of the learning algorithm — SVM in this case — is to leverage the dataset and find the optimal values 𝐰*\mathbf{w}^* and b*b^* for parameters 𝐰\mathbf{w} and bb. Once the learning algorithm identifies these optimal values, the model f(𝐱)f(\mathbf{x}) is then defined as:

F𝐱=符号𝐰*𝐱-*f(\mathbf{x}) = \operatorname{sign}(\mathbf{w}^*\mathbf{x} - b^*)

f(𝐱)=sign(𝐰*𝐱b*) f(\mathbf{x}) = \operatorname{sign}(\mathbf{w}^*\mathbf{x} - b^*)

因此,要使用 SVM 模型预测电子邮件是否为垃圾邮件,您必须获取邮件文本,将其转换为特征向量,然后将该向量乘以𝐰*\mathbf{w}^*, 减去*^*并取结果的符号。这将为我们提供预测(+1+1意思是“垃圾邮件”,-1-1意思是“非垃圾邮件”)。

Therefore, to predict whether an email message is spam or not spam using an SVM model, you have to take the text of the message, convert it into a feature vector, then multiply this vector by 𝐰*\mathbf{w}^*, subtract b*b^* and take the sign of the result. This will give us the prediction (+1+1 means “spam”, 1-1 means “not_spam”).

现在,机器如何找到𝐰*\mathbf{w}^**^*?它解决了一个优化问题。机器擅长在约束下优化功能。

Now, how does the machine find 𝐰*\mathbf{w}^* and b*b^*? It solves an optimization problem. Machines are good at optimizing functions under constraints.

那么我们在这里想要满足哪些约束呢?首先,我们希望模型能够正确预测 10,000 个示例的标签。请记住每个示例=1,……,10000i = 1,\ldots,10000由一对给出𝐱,y(\mathbf{x}_i, y_i), 在哪里𝐱\mathbf{x}_i是样本的特征向量y它的标签可以取值-1-1或者+1+1。那么约束条件自然是:

So what are the constraints we want to satisfy here? First of all, we want the model to predict the labels of our 10,000 examples correctly. Remember that each example i=1,,10000i = 1,\ldots,10000 is given by a pair (𝐱i,yi)(\mathbf{x}_i, y_i), where 𝐱i\mathbf{x}_i is the feature vector of example ii and yiy_i is its label that takes values either 1-1 or +1+1. So the constraints are naturally:

𝐰𝐱-+1如果y=+1,𝐰𝐱--1如果y=-1\begin{对齐} &\mathbf{w}\mathbf{x}_i-b \geq +1 && \text{if}\ y_i = +1, \\ &\mathbf{w}\mathbf{x}_i- b \leq -1 && \text{if}\ y_i = -1。 \结束{对齐}

𝐰𝐱ib+1ifyi=+1,𝐰𝐱ib1ifyi=1. \begin{aligned} &\mathbf{w}\mathbf{x}_i-b \geq +1 && \text{if}\ y_i = +1, \\ &\mathbf{w}\mathbf{x}_i-b \leq -1 && \text{if}\ y_i = -1. \end{aligned}

我们还希望超平面能够以最大的间隔将正例与负例分开。边距是由决策边界定义的两个类中最接近的示例之间的距离。较大的余量有助于更好的泛化,即模型将来对新示例进行分类的效果。为了实现这一点,我们需要最小化欧几里得范数𝐰\mathbf{w}表示为𝐰\|\mathbf{w}\|并由Σj=1Dwj2\sqrt{\sum_{j=1}^D (w^{(j)})^2}

We would also prefer that the hyperplane separates positive examples from negative ones with the largest margin. The margin is the distance between the closest examples of two classes, as defined by the decision boundary. A large margin contributes to a better generalization, that is how well the model will classify new examples in the future. To achieve that, we need to minimize the Euclidean norm of 𝐰\mathbf{w} denoted by 𝐰\|\mathbf{w}\| and given by j=1D(w(j))2\sqrt{\sum_{j=1}^D (w^{(j)})^2}.

因此,我们希望机器解决的优化问题如下所示:

So, the optimization problem that we want the machine to solve looks like this:

最小化 𝐰\|\mathbf{w}\| y𝐰𝐱-1y_i(\mathbf{w}\mathbf{x}_i-b)\geq 1 为了 =1,……,i=1,\,\l点,\,N。表达方式y𝐰𝐱-1y_i(\mathbf{w}\mathbf{x}_i-b)\geq 1只是写出上述两个约束的紧凑方式。

Minimize 𝐰\|\mathbf{w}\| subject to yi(𝐰𝐱ib)1y_i(\mathbf{w}\mathbf{x}_i-b)\geq 1 for i=1,,Ni=1,\,\ldots ,\,N. The expression yi(𝐰𝐱ib)1y_i(\mathbf{w}\mathbf{x}_i-b)\geq 1 is just a compact way to write the above two constraints.

该优化问题的解由下式给出𝐰*\mathbf{w}^**^*,称为统计模型,或简称模型。构建模型的过程称为训练

The solution of this optimization problem, given by 𝐰*\mathbf{w}^* and b*b^*, is called the statistical model, or, simply, the model. The process of building the model is called training.

图 1:二维特征向量的 SVM 模型示例。
图 1:二维特征向量的 SVM 模型示例。

对于二维特征向量,问题和解决方案可以如图 2 所示可视化。  1 .蓝色和橙色圆圈分别代表正例和负例,线由下式给出𝐰𝐱-=0\mathbf{w}\mathbf{x} - b = 0是决策边界。

For two-dimensional feature vectors, the problem and the solution can be visualized as shown in fig. 1. The blue and orange circles represent, respectively, positive and negative examples, and the line given by 𝐰𝐱b=0\mathbf{w}\mathbf{x} - b = 0 is the decision boundary.

为什么,通过最小化范数𝐰\mathbf{w},我们是否找到两个类别之间的最高差值?几何上,方程𝐰𝐱-=1\mathbf{w}\mathbf{x}-b = 1𝐰𝐱-=-1\mathbf{w}\mathbf{x}-b = -1定义两个平行的超平面,如图所示。  1 .这些超平面之间的距离由下式给出2𝐰\frac{2}{\|\mathbf{w}\|},所以范数越小𝐰\|\mathbf{w}\|,这两个超平面之间的距离越大。

Why, by minimizing the norm of 𝐰\mathbf{w}, do we find the highest margin between the two classes? Geometrically, the equations 𝐰𝐱b=1\mathbf{w}\mathbf{x}-b = 1 and 𝐰𝐱b=1\mathbf{w}\mathbf{x}-b = -1 define two parallel hyperplanes, as you see in fig. 1. The distance between these hyperplanes is given by 2𝐰\frac{2}{\|\mathbf{w}\|}, so the smaller the norm 𝐰\|\mathbf{w}\|, the larger the distance between these two hyperplanes.

这就是支持向量机的工作原理。该算法的这个特定版本构建了所谓的线性模型。之所以称为线性,是因为决策边界是一条直线(或平面或超平面)。 SVM 还可以包含可以使决策边界任意非线性的内核。在某些情况下,由于数据中的噪声、标记错误或异常值(示例与数据集中的“典型”示例非常不同),可能无法完美分离两组点。SVM 的另一个版本还可以包含惩罚超参数3,用于对特定类别的训练示例进行错误分类。我们将在第 3 章中更详细地研究 SVM 算法。

That’s how Support Vector Machines work. This particular version of the algorithm builds the so-called linear model. It’s called linear because the decision boundary is a straight line (or a plane, or a hyperplane). SVM can also incorporate kernels that can make the decision boundary arbitrarily non-linear. In some cases, it could be impossible to perfectly separate the two groups of points because of noise in the data, errors of labeling, or outliers (examples very different from a “typical” example in the dataset). Another version of SVM can also incorporate a penalty hyperparameter3 for misclassification of training examples of specific classes. We study the SVM algorithm in more detail in Chapter 3.

此时,您应该保留以下内容:任何隐式或显式构建模型的分类学习算法都会创建决策边界。决策边界可以是直的,也可以是曲线的,也可以是复杂的形式,也可以是一些几何图形的叠加。决策边界的形式决定了模型的准确性(即标签被正确预测的示例的比例)。决策边界的形式,即基于训练数据进行算法或数学计算的方式,将一种学习算法与另一种学习算法区分开来。

At this point, you should retain the following: any classification learning algorithm that builds a model implicitly or explicitly creates a decision boundary. The decision boundary can be straight, or curved, or it can have a complex form, or it can be a superposition of some geometrical figures. The form of the decision boundary determines the accuracy of the model (that is the ratio of examples whose labels are predicted correctly). The form of the decision boundary, the way it is algorithmically or mathematically computed based on the training data, differentiates one learning algorithm from another.

在实践中,学习算法还有两个重要的区别因素需要考虑:模型构建的速度和预测处理时间。在许多实际情况下,您更喜欢快速构建不太准确的模型的学习算法。此外,您可能更喜欢精度较低但预测速度更快的模型。

In practice, there are two other essential differentiators of learning algorithms to consider: speed of model building and prediction processing time. In many practical cases, you would prefer a learning algorithm that builds a less accurate model quickly. Additionally, you might prefer a less accurate model that is much quicker at making predictions.

1.4为什么模型适用于新数据

1.4 Why the Model Works on New Data

为什么机器学习模型能够正确预测新的、以前未见过的示例的标签?要理解这一点,请查看图 1 中的图。  1 .如果两个类可以通过决策边界彼此分离,那么显然,属于每个类的示例位于决策边界创建的两个不同的子空间中。

Why is a machine-learned model capable of predicting correctly the labels of new, previously unseen examples? To understand that, look at the plot in fig. 1. If two classes are separable from one another by a decision boundary, then, obviously, examples that belong to each class are located in two different subspaces which the decision boundary creates.

如果用于训练的示例是随机选择的,彼此独立,并遵循相同的过程,那么从统计角度来看,新的负示例更有可能位于绘图上距离其他负示例不太远的位置。新的正面例子也是如此:它很可能来自其他正面例子的周围。在这种情况下,我们的决策边界仍将以高概率将新的正面和负面示例彼此分开。对于其他不太可能的情况,我们的模型会出错,但由于这种情况不太可能,错误的数量可能会小于正确预测的数量。

If the examples used for training were selected randomly, independently of one another, and following the same procedure, then, statistically, it is more likely that the new negative example will be located on the plot somewhere not too far from other negative examples. The same concerns the new positive example: it will likely come from the surroundings of other positive examples. In such a case, our decision boundary will still, with high probability, separate well new positive and negative examples from one another. For other, less likely situations, our model will make errors, but because such situations are less likely, the number of errors will likely be smaller than the number of correct predictions.

直观上,训练示例集越大,新示例与用于训练的示例不同(并且在图上相距甚远)的可能性就越小。

Intuitively, the larger is the set of training examples, the more unlikely that the new examples will be dissimilar to (and lie on the plot far from) the examples used for training.

为了最大限度地减少在新示例上出错的概率,SVM 算法通过寻找最大余量,显式地尝试以尽可能远离两类示例的方式绘制决策边界。

To minimize the probability of making errors on new examples, the SVM algorithm, by looking for the largest margin, explicitly tries to draw the decision boundary in such a way that it lies as far as possible from examples of both classes.

有兴趣了解更多关于可学习性并了解模型误差、训练集大小、定义模型的数学方程形式以及构建模型所需时间之间密切关系的读者,鼓励阅读关于PAC学习。 PAC(“可能近似正确”)学习理论有助于分析学习算法是否以及在什么条件下可能输出近似正确的分类器。

The reader interested in knowing more about the learnability and understanding the close relationship between the model error, the size of the training set, the form of the mathematical equation that defines the model, and the time it takes to build the model is encouraged to read about the PAC learning. The PAC (for “probably approximately correct”) learning theory helps to analyze whether and under what conditions a learning algorithm will probably output an approximately correct classifier.


  1. 如果表达式以粗体显示,则表示这是科学术语的技术术语。如果你在书中再次遇到这个词,这个词的含义将完全相同。

  2. If an expression is in bold, that means that this is a technical of a scientific term. If you meet it once again in the book, the term will have exactly the same meaning.

  3. 实数是可以表示沿直线的距离的量。例子:00,-256.34-256.34,10001000,1000.21000.2

  4. A real number is a quantity that can represent a distance along a line. Examples: 00, 256.34-256.34, 10001000, 1000.21000.2.

  5. 超参数是学习算法的一个属性,通常(但不总是)具有数值。该值影响算法的工作方式。这些值不是由算法本身从数据中学习的。它们必须由数据分析师在运行算法之前设置。

  6. A hyperparameter is a property of a learning algorithm, usually (but not always) having a numerical value. That value influences the way the algorithm works. Those values aren’t learned by the algorithm itself from data. They have to be set by the data analyst before running the algorithm.

2符号和定义

2 Notation and Definitions

2.1符号

2.1 Notation

让我们首先回顾一下我们在学校学到的数学符号,但有些人可能在舞会结束后就忘记了。

Let’s start by revisiting the mathematical notation we all learned at school, but some likely forgot right after the prom.

2.1.1数据结构

2.1.1 Data Structures

标量是一个简单的数值,例如1515或者-3.25-3.25。采用标量值的变量或常量用斜体字母表示,例如XX或者AA

A scalar is a simple numerical value, like 1515 or 3.25-3.25. Variables or constants that take scalar values are denoted by an italic letter, like xx or aa.

向量标量值的有序列表,称为属性。我们将向量表示为粗体字符,例如,𝐱\mathbf{x}或者𝐰\mathbf{w}。向量可以可视化为指向某些方向的箭头以及多维空间中的点。三个二维向量的插图,𝐚=[2,3]\mathbf{a}=[2,3],𝐛=[-2,5]\mathbf{b}=[-2,5], 和𝐜=[1,0]\mathbf{c}=[1,0]在图中给出。  2和图。  3 .我们将向量的属性表示为带有索引的斜体值,如下所示:wjw^{(j)}或者Xjx^{(j)}。指数jj表示向量的特定维度,即属性在列表中的位置。例如,在向量中𝐚\mathbf{a}如图中红色所示 2和图。  3 ,A1=2a^{(1)} = 2A2=3a^{(2)} = 3

A vector is an ordered list of scalar values, called attributes. We denote a vector as a bold character, for example, 𝐱\mathbf{x} or 𝐰\mathbf{w}. Vectors can be visualized as arrows that point to some directions as well as points in a multi-dimensional space. Illustrations of three two-dimensional vectors, 𝐚=[2,3]\mathbf{a}=[2,3], 𝐛=[2,5]\mathbf{b}=[-2,5], and 𝐜=[1,0]\mathbf{c}=[1,0] are given in fig. 2 and fig. 3. We denote an attribute of a vector as an italic value with an index, like this: w(j)w^{(j)} or x(j)x^{(j)}. The index jj denotes a specific dimension of the vector, the position of an attribute in the list. For instance, in the vector 𝐚\mathbf{a} shown in red in fig. 2 and fig. 3, a(1)=2a^{(1)} = 2 and a(2)=3a^{(2)} = 3.

符号Xjx^{(j)}不应与幂运算符混淆,例如22X2x^2(平方)或33X3x^3(立方)。如果我们想将幂运算符(例如平方)应用于向量的索引属性,我们可以这样写:Xj2(x^{(j)})^2

The notation x(j)x^{(j)} should not be confused with the power operator, such as the 22 in x2x^2 (squared) or 33 in x3x^3 (cubed). If we want to apply a power operator, say square, to an indexed attribute of a vector, we write like this: (x(j))2(x^{(j)})^2.

一个变量可以有两个或多个索引,如下所示:Xjx_i^{(j)}或者像这样X,jkx_{i,j}^{(k)}。例如,在神经网络中,我们表示为X,jx_{l,u}^{(j)}输入特征jj单位的层内

A variable can have two or more indices, like this: xi(j)x_i^{(j)} or like this xi,j(k)x_{i,j}^{(k)}. For example, in neural networks, we denote as xl,u(j)x_{l,u}^{(j)} the input feature jj of unit uu in layer ll.

矩阵是按行和列排列数字的矩形阵列。下面是一个两行三列的矩阵示例,

A matrix is a rectangular array of numbers arranged in rows and columns. Below is an example of a matrix with two rows and three columns,

[24-321-6-1]\begin{bmatrix}2&4&-3\\21&-6&-1\end{bmatrix}。

[2432161]. \begin{bmatrix}2&4&-3\\21&-6&-1\end{bmatrix}.

矩阵用粗体大写字母表示,例如𝐀\mathbf{A}或者𝐖\mathbf{W}

Matrices are denoted with bold capital letters, such as 𝐀\mathbf{A} or 𝐖\mathbf{W}.

图 2:三个向量可视化为方向。
图 2:三个向量可视化为方向。
图 3:三个向量可视化为点。
图 3:三个向量可视化为点。

集合唯一元素的无序集合。我们将集合表示为书法大写字符,例如,𝒮\数学{S}。一组数字可以是有限的(包括固定数量的值)。在这种情况下,它使用荣誉来表示,例如,{1,3,18,23,235}\{1,3,18,23,235\}或者{X1,X2,X3,X4,……,Xn}\{x_1,x_2,x_3,x_4,\l点,x_n\}。集合可以是无限的,并且包括某个区间内的所有值。如果一个集合包含之间的所有值AA, 包括AA,用括号表示为[A,][一,二]。如果集合不包含值AA,这样的集合使用括号表示,如下所示:A,(一、二)。例如,集合[0,1][0,1]包括这样的值00,0.00010.0001,0.250.25,0.7840.784,0.99950.9995, 和1.01.0。一个特殊的集合表示\mathbb{R}包括从负无穷大到正无穷大的所有数字。

A set is an unordered collection of unique elements. We denote a set as a calligraphic capital character, for example, 𝒮\mathcal{S}. A set of numbers can be finite (include a fixed amount of values). In this case, it is denoted using accolades, for example, {1,3,18,23,235}\{1,3,18,23,235\} or {x1,x2,x3,x4,,xn}\{x_1,x_2,x_3,x_4,\ldots,x_n\}. A set can be infinite and include all values in some interval. If a set includes all values between aa and bb, including aa and bb, it is denoted using brackets as [a,b][a,b]. If the set doesn’t include the values aa and bb, such a set is denoted using parentheses like this: (a,b)(a,b). For example, the set [0,1][0,1] includes such values as 00, 0.00010.0001, 0.250.25, 0.7840.784, 0.99950.9995, and 1.01.0. A special set denoted \mathbb{R} includes all numbers from minus infinity to plus infinity.

当一个元素XX属于一个集合𝒮\数学{S}, 我们写Xε𝒮x \in \mathcal{S}。我们可以获得一套新的𝒮3\mathcal{S}_3作为两个集合的交集𝒮1\mathcal{S}_1𝒮2\mathcal{S}_2。在这种情况下,我们写𝒮3𝒮1𝒮2\mathcal{S}_3 \leftarrow \mathcal{S}_1 \cap \mathcal{S}_2。例如{1,3,5,8}{1,8,4}\{1,3,5,8\} \cap \{1,8,4\}给出新的集合{1,8}\{1,8\}

When an element xx belongs to a set 𝒮\mathcal{S}, we write x𝒮x \in \mathcal{S}. We can obtain a new set 𝒮3\mathcal{S}_3 as an intersection of two sets 𝒮1\mathcal{S}_1 and 𝒮2\mathcal{S}_2. In this case, we write 𝒮3𝒮1𝒮2\mathcal{S}_3 \leftarrow \mathcal{S}_1 \cap \mathcal{S}_2. For example {1,3,5,8}{1,8,4}\{1,3,5,8\} \cap \{1,8,4\} gives the new set {1,8}\{1,8\}.

我们可以获得一套新的𝒮3\mathcal{S}_3作为两个集合的并集𝒮1\mathcal{S}_1𝒮2\mathcal{S}_2。在这种情况下,我们写𝒮3𝒮1𝒮2\mathcal{S}_3 \leftarrow \mathcal{S}_1 \cup \mathcal{S}_2。例如{1,3,5,8}{1,8,4}\{1,3,5,8\} \cup \{1,8,4\}给出新的集合{1,3,4,5,8}\{1,3,4,5,8\}

We can obtain a new set 𝒮3\mathcal{S}_3 as a union of two sets 𝒮1\mathcal{S}_1 and 𝒮2\mathcal{S}_2. In this case, we write 𝒮3𝒮1𝒮2\mathcal{S}_3 \leftarrow \mathcal{S}_1 \cup \mathcal{S}_2. For example {1,3,5,8}{1,8,4}\{1,3,5,8\} \cup \{1,8,4\} gives the new set {1,3,4,5,8}\{1,3,4,5,8\}.

2.1.2大写西格玛符号

2.1.2 Capital Sigma Notation

对集合的求和𝑋={X1,X2,……,Xn-1,Xn}\mathit{X} = \{x_1, x_2, \ldots, x_{n-1}, x_n\}或向量的属性𝐱=[X1,X2,……,X-1,X]\mathbf{x} = [x^{(1)}, x^{(2)}, \ldots, x^{(m-1)}, x^{(m)}]表示如下:

The summation over a collection 𝑋={x1,x2,,xn1,xn}\mathit{X} = \{x_1, x_2, \ldots, x_{n-1}, x_n\} or over the attributes of a vector 𝐱=[x(1),x(2),,x(m1),x(m)]\mathbf{x} = [x^{(1)}, x^{(2)}, \ldots, x^{(m-1)}, x^{(m)}] is denoted like this:

Σ=1nX=定义X1+X2+……+Xn-1+Xn,\sum_{i=1}^n x_i \stackrel{\text{def}}{=} x_1 + x_2 + \ldots + x_{n-1} + x_n,

i=1nxi=defx1+x2++xn1+xn, \sum_{i=1}^n x_i \stackrel{\text{def}}{=} x_1 + x_2 + \ldots + x_{n-1} + x_n,

要不然:

or else:

Σj=1Xj=定义X1+X2+……+X-1+X\sum_{j=1}^mx^{(j)} \stackrel{\text{def}}{=} x^{(1)} + x^{(2)} + \ldots + x^{( m-1)} + x^{(m)}。

j=1mx(j)=defx(1)+x(2)++x(m1)+x(m). \sum_{j=1}^m x^{(j)} \stackrel{\text{def}}{=} x^{(1)} + x^{(2)} + \ldots + x^{(m-1)} + x^{(m)}.

符号=定义\stackrel{\text{def}}{=}意思是“被定义为”。

The notation =def\stackrel{\text{def}}{=} means “is defined as”.

2.1.3大写 Pi 表示法

2.1.3 Capital Pi Notation

类似于大写 sigma 的符号是大写 pi 符号。它表示集合中元素或向量属性的乘积:

A notation analogous to capital sigma is the capital pi notation. It denotes a product of elements in a collection or attributes of a vector:

=1nX=定义X1X2……Xn-1Xn, \prod_{i=1}^n x_i \stackrel{\text{def}}{=} x_1 \cdot x_2 \cdot \ldots \cdot x_{n-1} \cdot x_n,

i=1nxi=defx1x2xn1xn, \prod_{i=1}^n x_i \stackrel{\text{def}}{=} x_1 \cdot x_2 \cdot \ldots \cdot x_{n-1} \cdot x_n,

在哪里Aa \cdot b方法Aa乘以b。在可能的情况下,我们省略\cdot为了简化符号,所以Aa b也意味着Aa乘以b

where aba \cdot b means aa multiplied by bb. Where possible, we omit \cdot to simplify the notation, so aba b also means aa multiplied by bb.

2.1.4集合运算

2.1.4 Operations on Sets

派生集创建运算符如下所示:𝒮{X2|Xε𝒮,X>3}\mathcal{S}' \leftarrow \{x^2\, |\, x \in \mathcal{S}, x > 3 \}。这个符号意味着我们创建一个新集合𝒮\mathcal{S}'通过放入其中Xx平方使得Xx是在𝒮\mathcal{S}, 和Xx大于33

A derived set creation operator looks like this: 𝒮{x2|x𝒮,x>3}\mathcal{S}' \leftarrow \{x^2\, |\, x \in \mathcal{S}, x > 3 \}. This notation means that we create a new set 𝒮\mathcal{S}' by putting into it xx squared such that that xx is in 𝒮\mathcal{S}, and xx is greater than 33.

基数运算符|𝒮||\mathcal{S}|返回集合中元素的数量𝒮\mathcal{S}

The cardinality operator |𝒮||\mathcal{S}| returns the number of elements in set 𝒮\mathcal{S}.

2.1.5向量运算

2.1.5 Operations on Vectors

两个向量的和𝐱+𝐳\mathbf{x} + \mathbf{z}定义为向量[X1+z1,X2+z2,……,X+z][x^{(1)} + z^{(1)}, x^{(2)} + z^{(2)}, \ldots, x^{(m)} + z^{(m)}]

The sum of two vectors 𝐱+𝐳\mathbf{x} + \mathbf{z} is defined as the vector [x(1)+z(1),x(2)+z(2),,x(m)+z(m)][x^{(1)} + z^{(1)}, x^{(2)} + z^{(2)}, \ldots, x^{(m)} + z^{(m)}].

两个向量的差𝐱-𝐳\mathbf{x} - \mathbf{z}定义为[X1-z1,X2-z2,……,X-z][x^{(1)} - z^{(1)}, x^{(2)} - z^{(2)}, \ldots, x^{(m)} - z^{(m)}]

The difference of two vectors 𝐱𝐳\mathbf{x} - \mathbf{z} is defined as [x(1)z(1),x(2)z(2),,x(m)z(m)][x^{(1)} - z^{(1)}, x^{(2)} - z^{(2)}, \ldots, x^{(m)} - z^{(m)}].

向量乘以标量就是向量。例如𝐱C=定义[CX1,CX2,……,CX]\mathbf{x}c \stackrel{\text{def}}{=} [cx^{(1)}, cx^{(2)},\ldots,cx^{(m)}]

A vector multiplied by a scalar is a vector. For example 𝐱c=def[cx(1),cx(2),,cx(m)]\mathbf{x}c \stackrel{\text{def}}{=} [cx^{(1)}, cx^{(2)},\ldots,cx^{(m)}].

两个向量的点是标量。例如,𝐰𝐱=定义Σ=1wX\mathbf{w}\mathbf{x} \stackrel{\text{def}}{=} \sum_{i=1}^m w^{(i)}x^{(i)}。在一些书中,点积表示为𝐰𝐱\mathbf{w}\cdot\mathbf{x}。两个向量必须具有相同的维度。否则,点积未定义。

A dot-product of two vectors is a scalar. For example, 𝐰𝐱=defi=1mw(i)x(i)\mathbf{w}\mathbf{x} \stackrel{\text{def}}{=} \sum_{i=1}^m w^{(i)}x^{(i)}. In some books, the dot-product is denoted as 𝐰𝐱\mathbf{w}\cdot\mathbf{x}. The two vectors must be of the same dimensionality. Otherwise, the dot-product is undefined.

矩阵的乘法𝐖\mathbf{W}通过向量𝐱\mathbf{x}结果产生另一个向量。让我们的矩阵是,

The multiplication of a matrix 𝐖\mathbf{W} by a vector 𝐱\mathbf{x} results in another vector. Let our matrix be,

𝐖=[w1,1w1,2w1,3w2,1w2,2w2,3] \mathbf{W} = \begin{bmatrix} w^{(1,1)} & w^{(1,2)} & w^{(1,3)} \\ w^{(2,1)} & w^{(2,2)} & w^{(2,3)} \end{bmatrix}.

𝐖=[w(1,1)w(1,2)w(1,3)w(2,1)w(2,2)w(2,3)]. \mathbf{W} = \begin{bmatrix} w^{(1,1)} & w^{(1,2)} & w^{(1,3)} \\ w^{(2,1)} & w^{(2,2)} & w^{(2,3)} \end{bmatrix}.

当向量参与矩阵运算时,向量默认表示为一列矩阵。当向量位于矩阵右侧时,它仍然是列向量。仅当向量的行数与矩阵的列数相同时,我们才能将矩阵乘以向量。设我们的向量为𝐱=定义[X1,X2,X3]\mathbf{x} \stackrel{\text{def}}{=} [x^{(1)},x^{(2)},x^{(3)}]。然后𝐖𝐱\mathbf{W}\mathbf{x}是一个二维向量,定义为,

When vectors participate in operations on matrices, a vector is by default represented as a matrix with one column. When the vector is on the right of the matrix, it remains a column vector. We can only multiply a matrix by vector if the vector has the same number of rows as the number of columns in the matrix. Let our vector be 𝐱=def[x(1),x(2),x(3)]\mathbf{x} \stackrel{\text{def}}{=} [x^{(1)},x^{(2)},x^{(3)}]. Then 𝐖𝐱\mathbf{W}\mathbf{x} is a two-dimensional vector defined as,

𝐖𝐱=[w1,1w1,2w1,3w2,1w2,2w2,3][X1X2X3]=定义[w1,1X1+w1,2X2+w1,3X3w2,1X1+w2,2X2+w2,3X3]=[𝐰1𝐱𝐰2𝐱]\begin{align*} \mathbf{W}\mathbf{x} &= \begin{bmatrix} w^{(1,1)} & w^{(1,2)} & w^{(1,3)} \\ w^{(2,1)} & w^{(2,2)} & w^{(2,3)} \end{bmatrix} \begin{bmatrix} x^{(1)}\\ x^{(2)}\\ x^{(3)} \end{bmatrix}\\ &\stackrel{\text{def}}{=} \begin{bmatrix} w^{(1,1)}x^{(1)} + w^{(1,2)}x^{(2)} + w^{(1,3)}x^{(3)}\\ w^{(2,1)}x^{(1)} + w^{(2,2)}x^{(2)} + w^{(2,3)}x^{(3)} \end{bmatrix}\\ &= \begin{bmatrix} \mathbf{w}^{(1)}\mathbf{x}\\ \mathbf{w}^{(2)}\mathbf{x} \end{bmatrix} \end{align*}

𝐖𝐱=[w(1,1)w(1,2)w(1,3)w(2,1)w(2,2)w(2,3)][x(1)x(2)x(3)]=def[w(1,1)x(1)+w(1,2)x(2)+w(1,3)x(3)w(2,1)x(1)+w(2,2)x(2)+w(2,3)x(3)]=[𝐰(1)𝐱𝐰(2)𝐱]\begin{align*} \mathbf{W}\mathbf{x} &= \begin{bmatrix} w^{(1,1)} & w^{(1,2)} & w^{(1,3)} \\ w^{(2,1)} & w^{(2,2)} & w^{(2,3)} \end{bmatrix} \begin{bmatrix} x^{(1)}\\ x^{(2)}\\ x^{(3)} \end{bmatrix}\\ &\stackrel{\text{def}}{=} \begin{bmatrix} w^{(1,1)}x^{(1)} + w^{(1,2)}x^{(2)} + w^{(1,3)}x^{(3)}\\ w^{(2,1)}x^{(1)} + w^{(2,2)}x^{(2)} + w^{(2,3)}x^{(3)} \end{bmatrix}\\ &= \begin{bmatrix} \mathbf{w}^{(1)}\mathbf{x}\\ \mathbf{w}^{(2)}\mathbf{x} \end{bmatrix} \end{align*}

如果我们的矩阵有五行,那么乘积的结果将是一个五维向量。

If our matrix had, say, five rows, the result of the product would be a five-dimensional vector.

当向量在乘法中位于矩阵的左侧时,在与矩阵相乘之前必须对其进行转置。向量的转置𝐱\mathbf{x}表示为𝐱\mathbf{x}^\top从列向量中生成行向量。比方说,

When the vector is on the left side of the matrix in the multiplication, then it has to be transposed before we multiply it by the matrix. The transpose of the vector 𝐱\mathbf{x} denoted as 𝐱\mathbf{x}^\top makes a row vector out of a column vector. Let’s say,

𝐱=[X1X2],然后𝐱=定义[X1X2] \mathbf{x} = \begin{bmatrix} x^{(1)}\\ x^{(2)} \end{bmatrix},\, \, \textrm{then}\, \, \mathbf{x}^\top \stackrel{\text{def}}{=} \begin{bmatrix} x^{(1)} & x^{(2)} \end{bmatrix}.

𝐱=[x(1)x(2)],then𝐱=def[x(1)x(2)]. \mathbf{x} = \begin{bmatrix} x^{(1)}\\ x^{(2)} \end{bmatrix},\, \, \textrm{then}\, \, \mathbf{x}^\top \stackrel{\text{def}}{=} \begin{bmatrix} x^{(1)} & x^{(2)} \end{bmatrix}.

向量的乘法𝐱\mathbf{x}由矩阵𝐖\mathbf{W}是(谁)给的𝐱𝐖\mathbf{x}^{\top}\mathbf{W},

The multiplication of the vector 𝐱\mathbf{x} by the matrix 𝐖\mathbf{W} is given by 𝐱𝐖\mathbf{x}^{\top}\mathbf{W},

𝐱𝐖=[X1X2][w1,1w1,2w1,3w2,1w2,2w2,3]=定义[w1,1X1+w2,1X2,w1,2X1+w2,2X2,w1,3X1+w2,3X2]\begin{align*} \mathbf{x}^{\top}\mathbf{W} &= \begin{bmatrix} x^{(1)} & x^{(2)} \end{bmatrix} \begin{bmatrix} w^{(1,1)} & w^{(1,2)} & w^{(1,3)} \\ w^{(2,1)} & w^{(2,2)} & w^{(2,3)} \end{bmatrix}\\ &\stackrel{\text{def}}{=} \big[ w^{(1,1)}x^{(1)} + w^{(2,1)}x^{(2)},\\ &\,\,\,\,\,\, w^{(1,2)}x^{(1)} + w^{(2,2)}x^{(2)}, \\ &\,\,\,\,\,\, w^{(1,3)}x^{(1)} + w^{(2,3)}x^{(2)}\big] \end{align*}

𝐱𝐖=[x(1)x(2)][w(1,1)w(1,2)w(1,3)w(2,1)w(2,2)w(2,3)]=def[w(1,1)x(1)+w(2,1)x(2),w(1,2)x(1)+w(2,2)x(2),w(1,3)x(1)+w(2,3)x(2)]\begin{align*} \mathbf{x}^{\top}\mathbf{W} &= \begin{bmatrix} x^{(1)} & x^{(2)} \end{bmatrix} \begin{bmatrix} w^{(1,1)} & w^{(1,2)} & w^{(1,3)} \\ w^{(2,1)} & w^{(2,2)} & w^{(2,3)} \end{bmatrix}\\ &\stackrel{\text{def}}{=} \big[ w^{(1,1)}x^{(1)} + w^{(2,1)}x^{(2)},\\ &\,\,\,\,\,\, w^{(1,2)}x^{(1)} + w^{(2,2)}x^{(2)}, \\ &\,\,\,\,\,\, w^{(1,3)}x^{(1)} + w^{(2,3)}x^{(2)}\big] \end{align*}

正如您所看到的,只有当向量的维数与矩阵的行数相同时,我们才能将向量乘以矩阵。

As you can see, we can only multiply a vector by a matrix if the vector has the same number of dimensions as the number of rows in the matrix.

2.1.6功能

2.1.6 Functions

函数是关联每个元素的关系Xx一组的𝒳\mathcal{X},函数的域,到单个元素yy另一组的𝒴\mathcal{Y},函数的余域。函数通常有一个名称。如果该函数被调用Ff,这个关系表示为y=FXy = f(x)(读FfXx), 元素Xx是函数的参数或输入,并且yy是函数或输出的值。用于表示输入的符号是函数的变量(我们常说Ff是变量的函数Xx)。

A function is a relation that associates each element xx of a set 𝒳\mathcal{X}, the domain of the function, to a single element yy of another set 𝒴\mathcal{Y}, the codomain of the function. A function usually has a name. If the function is called ff, this relation is denoted y=f(x)y = f(x) (read ff of xx), the element xx is the argument or input of the function, and yy is the value of the function or the output. The symbol that is used for representing the input is the variable of the function (we often say that ff is a function of the variable xx).

图 4:函数的局部和全局最小值。
图 4:函数的局部和全局最小值。

我们这么说FXf(x)局部最小值为X=Cx = c如果FXFCf(x) \geq f(c)对于每一个Xx在一些开区间内X=Cx = c。区间是一组实数,其属性是位于该集合中两个数字之间的任何数字也包含在该集合中开区间不包括其端点,并使用括号表示。例如,0,1(0,1)意思是“所有大于00并且小于11”。所有局部最小值中的最小值称为全局最小值。参见图 1 中的插图。  4 .

We say that f(x)f(x) has a local minimum at x=cx = c if f(x)f(c)f(x) \geq f(c) for every xx in some open interval around x=cx = c. An interval is a set of real numbers with the property that any number that lies between two numbers in the set is also included in the set. An open interval does not include its endpoints and is denoted using parentheses. For example, (0,1)(0,1) means “all numbers greater than 00 and less than 11”. The minimal value among all the local minima is called the global minimum. See illustration in fig. 4.

向量函数,表示为𝐲=𝐟X\mathbf{y} = \bm{f}(x)是一个返回向量的函数𝐲\mathbf{y}。它可以有一个向量或一个标量参数。

A vector function, denoted as 𝐲=𝐟(x)\mathbf{y} = \bm{f}(x) is a function that returns a vector 𝐲\mathbf{y}. It can have a vector or a scalar argument.

2.1.7最大值和精氨酸最大值

2.1.7 Max and Arg Max

给定一组值𝒜={A1,A2,……,An}\mathcal{A} = \{a_1, a_2,\ldots, a_n\}, 运营商最大限度Aε𝒜FA\max_{a\in\mathcal{A}} f(a)返回最高值FAf(a)对于集合中的所有元素𝒜\mathcal{A}。另一方面,运营商ArGAXAε𝒜FA\underset{a\in\mathcal{A}}{\mathrm{argmax}}f(a)返回集合的元素𝒜\mathcal{A}最大化FAf(a)

Given a set of values 𝒜={a1,a2,,an}\mathcal{A} = \{a_1, a_2,\ldots, a_n\}, the operator maxa𝒜f(a)\max_{a\in\mathcal{A}} f(a) returns the highest value f(a)f(a) for all elements in the set 𝒜\mathcal{A}. On the other hand, the operator argmaxa𝒜f(a)\underset{a\in\mathcal{A}}{\mathrm{argmax}}f(a) returns the element of the set 𝒜\mathcal{A} that maximizes f(a)f(a).

有时,当集合是隐式的或无限的时,我们可以写最大限度AFA\max_{a} f(a)或者ArGAXAFA\underset{a}{\mathrm{argmax}}f(a)

Sometimes, when the set is implicit or infinite, we can write maxaf(a)\max_{a} f(a) or argmaxaf(a)\underset{a}{\mathrm{argmax}}f(a).

运营商分钟\minArGn\mathrm{argmin}以类似的方式操作。

Operators min\min and argmin\mathrm{argmin} operate in a similar manner.

2.1.8赋值运算符

2.1.8 Assignment Operator

表达方式AFXa \gets f(x)意味着变量Aa得到新值:结果FXf(x)。我们说变量Aa被分配一个新值。相似地,𝐚[A1,A2]\mathbf{a} \gets [a_1, a_2]意味着向量变量𝐚\mathbf{a}获取二维向量值[A1,A2][a_1, a_2]

The expression af(x)a \gets f(x) means that the variable aa gets the new value: the result of f(x)f(x). We say that the variable aa gets assigned a new value. Similarly, 𝐚[a1,a2]\mathbf{a} \gets [a_1, a_2] means that the vector variable 𝐚\mathbf{a} gets the two-dimensional vector value [a1,a2][a_1, a_2].

2.1.9导数和梯度

2.1.9 Derivative and Gradient

衍生 Ff'一个函数的Ff是一个函数或一个值,描述多快Ff增长(或减少)。如果导数是一个常数值,例如55或者-3-3,则函数在任意点不断增大(或减小)Xx其域。如果导数Ff'是一个函数,那么函数Ff可以在其领域的不同区域以不同的速度增长。如果导数Ff'在某个时刻是积极的Xx,那么函数Ff在这一点上成长。如果导数为Ff在某些情况下是负数Xx,则此时函数减小。零处的导数Xx意味着函数的斜率在Xx是水平的。

A derivative ff' of a function ff is a function or a value that describes how fast ff grows (or decreases). If the derivative is a constant value, like 55 or 3-3, then the function grows (or decreases) constantly at any point xx of its domain. If the derivative ff' is a function, then the function ff can grow at a different pace in different regions of its domain. If the derivative ff' is positive at some point xx, then the function ff grows at this point. If the derivative of ff is negative at some xx, then the function decreases at this point. The derivative of zero at xx means that the function’s slope at xx is horizontal.

寻找导数的过程称为微分

The process of finding a derivative is called differentiation.

基本函数的导数是已知的。例如如果FX=X2f(x) = x^2, 然后FX=2Xf'(x) = 2x;如果FX=2Xf(x) = 2x然后FX=2f'(x) = 2;如果FX=2f(x) = 2然后FX=0f'(x) = 0(任何函数的导数FX=Cf(x) = c, 在哪里Cc是一个常数,为零)。

Derivatives for basic functions are known. For example if f(x)=x2f(x) = x^2, then f(x)=2xf'(x) = 2x; if f(x)=2xf(x) = 2x then f(x)=2f'(x) = 2; if f(x)=2f(x) = 2 then f(x)=0f'(x) = 0 (the derivative of any function f(x)=cf(x) = c, where cc is a constant value, is zero).

如果我们要求导的函数不是基函数,我们可以利用链式法则求它的导数。例如如果FX=FGXF(x) = f(g(x)), 在哪里FfGg是一些函数,那么FX=FGXGXF'(x)=f'(g(x))g'(x)。例如如果FX=5X+12F(x) = (5x + 1)^2然后GX=5X+1g(x) = 5x + 1FGX=GX2f(g(x)) = (g(x))^2。应用链式法则,我们发现:

If the function we want to differentiate is not basic, we can find its derivative using the chain rule. For instance if F(x)=f(g(x))F(x) = f(g(x)), where ff and gg are some functions, then F(x)=f(g(x))g(x)F'(x)=f'(g(x))g'(x). For example if F(x)=(5x+1)2F(x) = (5x + 1)^2 then g(x)=5x+1g(x) = 5x + 1 and f(g(x))=(g(x))2f(g(x)) = (g(x))^2. By applying the chain rule, we find:

FX=25X+1GX=25X+15=50X+10\begin{aligned}F'(x) &= 2(5x + 1)g'(x) \\ &= 2(5x + 1)5 \\ &= 50x + 10.\end{aligned}

F(x)=2(5x+1)g(x)=2(5x+1)5=50x+10.\begin{aligned}F'(x) &= 2(5x + 1)g'(x) \\ &= 2(5x + 1)5 \\ &= 50x + 10.\end{aligned}

梯度是采用多个输入(或采用向量或某种其他复杂结构形式的一个输入)的函数导数的概括。函数的梯度是偏导数的向量。您可以将求函数的偏导数视为通过关注函数的输入之一并将所有其他输入视为常数值来求导数的过程。

Gradient is the generalization of derivative for functions that take several inputs (or one input in the form of a vector or some other complex structure). A gradient of a function is a vector of partial derivatives. You can look at finding a partial derivative of a function as the process of finding the derivative by focusing on one of the function’s inputs and by considering all other inputs as constant values.

例如,如果我们的函数定义为F[X1,X2]=AX1+X2+Cf([x^{(1)},x^{(2)}]) = ax^{(1)} + bx^{(2)} + c,然后函数的偏导数Ff 关于 X1x^{(1)},表示为FX1\frac{\partial f}{\partial x^{(1)}}, 是(谁)给的,

For example, if our function is defined as f([x(1),x(2)])=ax(1)+bx(2)+cf([x^{(1)},x^{(2)}]) = ax^{(1)} + bx^{(2)} + c, then the partial derivative of function ff with respect to x(1)x^{(1)}, denoted as fx(1)\frac{\partial f}{\partial x^{(1)}}, is given by,

FX1=A+0+0=A, \frac{\partial f}{\partial x^{(1)}} = a + 0 + 0 = a,

fx(1)=a+0+0=a, \frac{\partial f}{\partial x^{(1)}} = a + 0 + 0 = a,

在哪里Aa是函数的导数AX1ax^{(1)};两个零点分别是X2bx^{(2)}Cc, 因为X2x^{(2)}当我们计算关于的导数时被认为是常数X1x^{(1)},并且任何常数的导数为零。

where aa is the derivative of the function ax(1)ax^{(1)}; the two zeroes are respectively derivatives of bx(2)bx^{(2)} and cc, because x(2)x^{(2)} is considered constant when we compute the derivative with respect to x(1)x^{(1)}, and the derivative of any constant is zero.

类似地,函数的偏导数Ff关于X2x^{(2)},FX2\frac{\partial f}{\partial x^{(2)}}, 是(谁)给的,

Similarly, the partial derivative of function ff with respect to x(2)x^{(2)}, fx(2)\frac{\partial f}{\partial x^{(2)}}, is given by,

FX2=0++0= \frac{\partial f}{\partial x^{(2)}} = 0 + b + 0 = b.

fx(2)=0+b+0=b. \frac{\partial f}{\partial x^{(2)}} = 0 + b + 0 = b.

函数的梯度Ff,表示为F\nabla f由向量给出[FX1,FX2]\left[\frac{\partial f}{\partial x^{(1)}}, \frac{\partial f}{\partial x^{(2)}}\right]

The gradient of function ff, denoted as f\nabla f is given by the vector [fx(1),fx(2)]\left[\frac{\partial f}{\partial x^{(1)}}, \frac{\partial f}{\partial x^{(2)}}\right].

正如我在第 4 章中所说明的,链式法则也适用于偏导数。

The chain rule works with partial derivatives too, as I illustrate in Chapter 4.

2.2随机变量

2.2 Random Variable

随机变量,通常写为斜体大写字母,例如XX,是一个变量,其可能值是随机现象的数值结果。具有数值结果的随机现象的例子包括抛硬币(00对于头部和11尾巴)、掷骰子或您在外面遇到的第一个陌生人的身高。有两种类型的随机变量:离散变量连续变量

A random variable, usually written as an italic capital letter, like XX, is a variable whose possible values are numerical outcomes of a random phenomenon. Examples of random phenomena with a numerical outcome include a toss of a coin (00 for heads and 11 for tails), a roll of a dice, or the height of the first stranger you meet outside. There are two types of random variables: discrete and continuous.

离散随机变量仅具有可数个不同值,例如redred,yewyellow,eblue或者11,22,33,……\ldots

A discrete random variable takes on only a countable number of distinct values such as redred, yellowyellow, blueblue or 11, 22, 33, \ldots.

离散随机变量的概率分布由与其每个可能值相关的概率列表来描述。该概率列表称为概率质量函数(pmf)。例如:普罗X=red=0.3\Pr(X = red) = 0.3,普罗X=yew=0.45\Pr(X = yellow) = 0.45,普罗X=e=0.25\Pr(X = blue) = 0.25。概率质量函数中的每个概率都是大于或等于的值00。概率之和等于11(图 5)。

The probability distribution of a discrete random variable is described by a list of probabilities associated with each of its possible values. This list of probabilities is called a probability mass function (pmf). For example: Pr(X=red)=0.3\Pr(X = red) = 0.3, Pr(X=yellow)=0.45\Pr(X = yellow) = 0.45, Pr(X=blue)=0.25\Pr(X = blue) = 0.25. Each probability in a probability mass function is a value greater than or equal to 00. The sum of probabilities equals 11 (fig. 5).

图 5:概率质量函数。
图 5:概率质量函数。
图 6:概率密度函数。
图 6:概率密度函数。

连续随机变量( CRV) 在某个区间内具有无限多个可能值。示例包括身高、体重和时间。因为连续随机变量的值的数量XX是无穷大,概率普罗X=C\Pr(X = c)对于任何Cc00。因此,CRV 的概率分布(连续概率分布)不是用概率列表来描述,而是用概率密度函数(pdf) 来描述。 pdf 是一个函数,其余域为非负,曲线下面积等于11(图 6)。

A continuous random variable (CRV) takes an infinite number of possible values in some interval. Examples include height, weight, and time. Because the number of values of a continuous random variable XX is infinite, the probability Pr(X=c)\Pr(X = c) for any cc is 00. Therefore, instead of the list of probabilities, the probability distribution of a CRV (a continuous probability distribution) is described by a probability density function (pdf). The pdf is a function whose codomain is nonnegative and the area under the curve is equal to 11 (fig. 6).

设离散随机变量XXkk可能的值{X}=1k\{x_i\}_{i=1}^k。的期望XX表示为𝔼[X]\mathbb{E}[X]是(谁)给的,

Let a discrete random variable XX have kk possible values {xi}i=1k\{x_i\}_{i=1}^k. The expectation of XX denoted as 𝔼[X]\mathbb{E}[X] is given by,

𝔼[X]=定义Σ=1k[X普罗X=X]=X1普罗X=X1+X2普罗X=X2++Xk普罗X=Xk,1 \begin{aligned} \mathbb{E}[X] &\stackrel{\text{def}}{=} \sum_{i=1}^{k}\left[x_{i}\cdot\Pr(X = x_i)\right] \\ &= x_{1}\cdot\Pr(X = x_1)\\ &\,\,\,\,\,\,\, + x_{2}\cdot\Pr(X = x_2)+\cdots\\ &\,\,\,\,\,\,\, + x_{k}\cdot\Pr(X = x_k), \end{aligned} \qquad(1)

𝔼[X]=defi=1k[xiPr(X=xi)]=x1Pr(X=x1)+x2Pr(X=x2)++xkPr(X=xk),(1) \begin{aligned} \mathbb{E}[X] &\stackrel{\text{def}}{=} \sum_{i=1}^{k}\left[x_{i}\cdot\Pr(X = x_i)\right] \\ &= x_{1}\cdot\Pr(X = x_1)\\ &\,\,\,\,\,\,\, + x_{2}\cdot\Pr(X = x_2)+\cdots\\ &\,\,\,\,\,\,\, + x_{k}\cdot\Pr(X = x_k), \end{aligned} \qquad(1)

在哪里普罗X=X\Pr(X = x_i)是概率XX有价值Xx_i根据pmf。随机变量的期望也称为均值平均数期望值,通常用字母表示μ\mu。期望是随机变量最重要的统计量之一。

where Pr(X=xi)\Pr(X = x_i) is the probability that XX has the value xix_i according to the pmf. The expectation of a random variable is also called the mean, average or expected value and is frequently denoted with the letter μ\mu. The expectation is one of the most important statistics of a random variable.

另一个重要的统计数据是标准差,定义为:

Another important statistic is the standard deviation, defined as,

σ=定义𝔼[X-μ2] \sigma \stackrel{\text{def}}{=} \sqrt{\mathbb{E} [(X-\mu)^2]}.

σ=def𝔼[(Xμ)2]. \sigma \stackrel{\text{def}}{=} \sqrt{\mathbb{E} [(X-\mu)^2]}.

方差,表示为σ2\sigma^2或者vArXvar(X),定义为,

Variance, denoted as σ2\sigma^2 or var(X)var(X), is defined as,

σ2=𝔼[X-μ2] \sigma^2 = \mathbb{E} [(X-\mu)^2].

σ2=𝔼[(Xμ)2]. \sigma^2 = \mathbb{E} [(X-\mu)^2].

对于离散随机变量,标准差由下式给出:

For a discrete random variable, the standard deviation is given by:

σ=普罗X=X1X1-μ2+普罗X=X2X2-μ2++普罗X=XkXk-μ2, \sigma = \sqrt{\begin{aligned} ~& \Pr(X = x_1)(x_{1}-\mu )^{2}\\ & +\Pr(X = x_2)(x_{2}-\mu )^{2}+\cdots\\ & + \Pr(X = x_k)(x_{k}-\mu )^{2}, \end{aligned}}

σ=Pr(X=x1)(x1μ)2+Pr(X=x2)(x2μ)2++Pr(X=xk)(xkμ)2, \sigma = \sqrt{\begin{aligned} ~& \Pr(X = x_1)(x_{1}-\mu )^{2}\\ & +\Pr(X = x_2)(x_{2}-\mu )^{2}+\cdots\\ & + \Pr(X = x_k)(x_{k}-\mu )^{2}, \end{aligned}}

在哪里μ=𝔼[X]\mu = \mathbb{E}[X]

where μ=𝔼[X]\mu = \mathbb{E}[X].

连续随机变量的期望XX是(谁)给的,

The expectation of a continuous random variable XX is given by,

𝔼[X]=定义XFXXdX,2 \mathbb{E}[X] \stackrel{\text{def}}{=} \int_{\mathbb {R}} xf_X(x)\,dx, \qquad(2)

𝔼[X]=defxfX(x)dx,(2) \mathbb{E}[X] \stackrel{\text{def}}{=} \int_{\mathbb {R}} xf_X(x)\,dx, \qquad(2)

在哪里FXf_X是变量的 pdfXX\int_{\mathbb {R}}是函数的积分XFXxf_X

where fXf_X is the pdf of the variable XX and \int_{\mathbb {R}} is the integral of function xfXxf_X.

当函数具有连续域时,积分相当于函数所有值的总和。它等于函数曲线下的面积。 pdf 的特性是曲线下面积为11数学上意味着FXXdX=1\int_{\mathbb {R}} f_X(x)\,dx = 1

Integral is an equivalent of the summation over all values of the function when the function has a continuous domain. It equals the area under the curve of the function. The property of the pdf that the area under its curve is 11 mathematically means that fX(x)dx=1\int_{\mathbb {R}} f_X(x)\,dx = 1.

大多数时候我们不知道FXf_X,但我们可以观察到一些值XX。在机器学习中,我们将这些值称为示例,这些示例的集合称为样本数据集

Most of the time we don’t know fXf_X, but we can observe some values of XX. In machine learning, we call these values examples, and the collection of these examples is called a sample or a dataset.

2.3无偏估计量

2.3 Unbiased Estimators

因为FXf_X通常是未知的,但我们有一个样本SX={X}=1S_X = \{x_i\}_{i=1}^N,我们常常不满足于概率分布统计的真实值,例如期望值,而是满足于它们的无偏估计量

Because fXf_X is usually unknown, but we have a sample SX={xi}i=1NS_X = \{x_i\}_{i=1}^N, we often content ourselves not with the true values of statistics of the probability distribution, such as expectation, but with their unbiased estimators.

我们这么说θ̂SX\hat{\theta}(S_X)是某些统计量的无偏估计量θ\theta使用样本计算SXS_X从未知的概率分布中得出如果θ̂SX\hat{\theta}(S_X)具有以下属性:

We say that θ̂(SX)\hat{\theta}(S_X) is an unbiased estimator of some statistic θ\theta calculated using a sample SXS_X drawn from an unknown probability distribution if θ̂(SX)\hat{\theta}(S_X) has the following property:

𝔼[θ̂SX]=θ, \mathbb{E}\left[\hat{\theta}(S_X)\right] = \theta,

𝔼[θ̂(SX)]=θ, \mathbb{E}\left[\hat{\theta}(S_X)\right] = \theta,

在哪里θ̂\hat{\theta}样本统计量,使用样本获得SXS_X而不是真实的统计数据θ\theta只有知道才能获得XX;期望值取自所有可能的样本XX。直观地说,这意味着如果您可以拥有无​​限数量的此类样本,例如SXS_X,然后计算一些无偏估计量,例如μ̂\hat{\mu},使用每个样本,然后计算所有这些样本的平均值μ̂\hat{\mu}等于真实统计数据μ\mu你会得到计算XX

where θ̂\hat{\theta} is a sample statistic, obtained using a sample SXS_X and not the real statistic θ\theta that can be obtained only knowing XX; the expectation is taken over all possible samples drawn from XX. Intuitively, this means that if you can have an unlimited number of such samples as SXS_X, and you compute some unbiased estimator, such as μ̂\hat{\mu}, using each sample, then the average of all these μ̂\hat{\mu} equals the real statistic μ\mu that you would get computed on XX.

可以证明,未知数的无偏估计量𝔼[X]\mathbb{E}[X](由等式 1或等式 2给出)由下式给出1Σ=1X\frac{1}{N}\sum_{i=1}^N x_i(在统计学中称为样本平均值)。

It can be shown that an unbiased estimator of an unknown 𝔼[X]\mathbb{E}[X] (given by either eq. 1 or eq. 2) is given by 1Ni=1Nxi\frac{1}{N}\sum_{i=1}^N x_i (called in statistics the sample mean).

2.4贝叶斯法则

2.4 Bayes’ Rule

条件概率普罗X=X|=y\Pr(X = x|Y = y)是随机变量的概率XX具有特定的值Xx考虑到另一个随机变量Y具有特定的值yy。贝叶斯规则(也称为贝叶斯定理)规定:

The conditional probability Pr(X=x|Y=y)\Pr(X = x|Y = y) is the probability of the random variable XX to have a specific value xx given that another random variable YY has a specific value of yy. The Bayes’ Rule (also known as the Bayes’ Theorem) stipulates that:

普罗X=X|=y=普罗=y|X=X普罗X=X普罗=y \begin{split}\Pr(X = x|Y = y) = \\ \frac{\Pr(Y = y|X = x)\Pr(X = x)}{\Pr(Y = y)}.\end{split}

Pr(X=x|Y=y)=Pr(Y=y|X=x)Pr(X=x)Pr(Y=y). \begin{split}\Pr(X = x|Y = y) = \\ \frac{\Pr(Y = y|X = x)\Pr(X = x)}{\Pr(Y = y)}.\end{split}

2.5参数估计

2.5 Parameter Estimation

当我们有一个模型时,贝叶斯规则就会派上用场XX的分布,以及这个模型F𝛉f_{\mathbf{\theta}}是一个具有向量形式参数的函数𝛉\mathbf{\theta}。这种函数的一个例子是具有两个参数的高斯函数,μ\muσ\sigma,定义为:

Bayes’ Rule comes in handy when we have a model of XX’s distribution, and this model f𝛉f_{\mathbf{\theta}} is a function that has some parameters in the form of a vector 𝛉\mathbf{\theta}. An example of such a function could be the Gaussian function that has two parameters, μ\mu and σ\sigma, and is defined as:

F𝛉X=12πσ2e-X-μ22σ2,3 f_{\bm{\theta}}(x)={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu)^{2}}{2\sigma^{2}}}}, \qquad(3)

f𝛉(x)=12πσ2e(xμ)22σ2,(3) f_{\bm{\theta}}(x)={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu)^{2}}{2\sigma^{2}}}}, \qquad(3)

在哪里𝛉=定义[μ,σ]\bm{\theta} \stackrel{\text{def}}{=} [\mu,\sigma]π\pi是常数 (3.14159……3.14159\ldots)。

where 𝛉=def[μ,σ]\bm{\theta} \stackrel{\text{def}}{=} [\mu,\sigma] and π\pi is the constant (3.141593.14159\ldots).

该函数具有 pdf 1的所有属性。因此,我们可以将其用作未知分布的模型XX。我们可以更新向量中的参数值𝛉\mathbf{\theta}使用贝叶斯规则从数据中得出:

This function has all the properties of a pdf1. Therefore, we can use it as a model of an unknown distribution of XX. We can update the values of parameters in the vector 𝛉\mathbf{\theta} from the data using the Bayes’ Rule:

普罗𝛉=𝛉̂|X=X普罗X=X|𝛉=𝛉̂普罗𝛉=𝛉̂普罗X=X=普罗X=X|𝛉=𝛉̂普罗𝛉=𝛉̂Σ𝛉̃普罗X=X|𝛉=𝛉̃普罗𝛉=𝛉̃4 \begin{split}\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}|X = x) \leftarrow \\ \frac{\Pr(X = x|\mathbf{\theta} = \hat{\mathbf{\theta}})\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}})}{\Pr(X = x)} \\ = \frac{\Pr(X = x|\mathbf{\theta} = \hat{\mathbf{\theta}})\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}})}{\sum_{\tilde{\mathbf{\theta}}}\Pr(X = x|\mathbf{\theta} = \tilde{\mathbf{\theta}})\Pr(\mathbf{\theta} = \tilde{\mathbf{\theta}})}.\end{split} \qquad(4)

Pr(𝛉=𝛉̂|X=x)Pr(X=x|𝛉=𝛉̂)Pr(𝛉=𝛉̂)Pr(X=x)=Pr(X=x|𝛉=𝛉̂)Pr(𝛉=𝛉̂)𝛉̃Pr(X=x|𝛉=𝛉̃)Pr(𝛉=𝛉̃).(4) \begin{split}\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}|X = x) \leftarrow \\ \frac{\Pr(X = x|\mathbf{\theta} = \hat{\mathbf{\theta}})\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}})}{\Pr(X = x)} \\ = \frac{\Pr(X = x|\mathbf{\theta} = \hat{\mathbf{\theta}})\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}})}{\sum_{\tilde{\mathbf{\theta}}}\Pr(X = x|\mathbf{\theta} = \tilde{\mathbf{\theta}})\Pr(\mathbf{\theta} = \tilde{\mathbf{\theta}})}.\end{split} \qquad(4)

在哪里普罗X=X|𝛉=𝛉̂=定义F𝛉̂\Pr(X = x|\mathbf{\theta} = \hat{\mathbf{\theta}}) \stackrel{\text{def}}{=} f_{\hat{\mathbf{\theta}}}

where Pr(X=x|𝛉=𝛉̂)=deff𝛉̂\Pr(X = x|\mathbf{\theta} = \hat{\mathbf{\theta}}) \stackrel{\text{def}}{=} f_{\hat{\mathbf{\theta}}}.

如果我们有样品𝒮\mathcal{S}XX以及可能值的集合𝛉\mathbf{\theta}是有限的,我们可以很容易地估计普罗𝛉=𝛉̂\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}})通过迭代应用贝叶斯规则,一个例子Xε𝒮x \in \mathcal{S}一次。初始值普罗𝛉=𝛉̂\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}})可以猜测Σ𝛉̂普罗𝛉=𝛉̂=1\sum_{\hat{\mathbf{\theta}}} \Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}) = 1。这种对不同概率的猜测𝛉̂\hat{\mathbf{\theta}}称为先验.

If we have a sample 𝒮\mathcal{S} of XX and the set of possible values for 𝛉\mathbf{\theta} is finite, we can easily estimate Pr(𝛉=𝛉̂)\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}) by applying Bayes’ Rule iteratively, one example x𝒮x \in \mathcal{S} at a time. The initial value Pr(𝛉=𝛉̂)\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}) can be guessed such that 𝛉̂Pr(𝛉=𝛉̂)=1\sum_{\hat{\mathbf{\theta}}} \Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}) = 1. This guess of the probabilities for different 𝛉̂\hat{\mathbf{\theta}} is called the prior.

首先,我们计算普罗𝛉=𝛉̂|X=X1\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}|X = x_1)对于所有可能的值𝛉̂\hat{\mathbf{\theta}}。然后,在更新之前普罗𝛉=𝛉̂|X=X\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}|X = x)再一次,这次是为了X=X2ε𝒮x = x_2 \in \mathcal{S}使用等式 4、我们替换掉之前的普罗𝛉=𝛉̂\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}})在等式中 4按新估计普罗𝛉=𝛉̂1ΣXε𝒮普罗𝛉=𝛉̂|X=X\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}) \leftarrow \frac{1}{N}\sum_{x \in \mathcal{S}} \Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}|X = x)

First, we compute Pr(𝛉=𝛉̂|X=x1)\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}|X = x_1) for all possible values 𝛉̂\hat{\mathbf{\theta}}. Then, before updating Pr(𝛉=𝛉̂|X=x)\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}|X = x) once again, this time for x=x2𝒮x = x_2 \in \mathcal{S} using eq. 4, we replace the prior Pr(𝛉=𝛉̂)\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}) in eq. 4 by the new estimate Pr(𝛉=𝛉̂)1Nx𝒮Pr(𝛉=𝛉̂|X=x)\Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}) \leftarrow \frac{1}{N}\sum_{x \in \mathcal{S}} \Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}|X = x).

参数的最佳值𝛉*\mathbf{\theta}^{*}给定一个例子是使用最大后验概率(或 MAP)原理获得的:

The best value of the parameters 𝛉*\mathbf{\theta}^{*} given one example is obtained using the principle of maximum a posteriori (or MAP):

𝛉*=精氨酸最大限度𝛉=1普罗𝛉=𝛉̂|X=X5 \mathbf{\theta}^{*} = \underset{\mathbf{\theta}}{\arg\max} \prod_{i=1}^N \Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}|X = x_i). \qquad(5)

𝛉*=argmax𝛉i=1NPr(𝛉=𝛉̂|X=xi).(5) \mathbf{\theta}^{*} = \underset{\mathbf{\theta}}{\arg\max} \prod_{i=1}^N \Pr(\mathbf{\theta} = \hat{\mathbf{\theta}}|X = x_i). \qquad(5)

如果可能值的集合𝛉\mathbf{\theta}不是有限的,那么我们需要优化eq。  5直接使用数值优化例程,例如我们在第 4 章中考虑的梯度下降。通常,我们优化等式 5 中右侧表达式的自然对数。  5因为乘积的对数变成了对数之和,并且机器处理和比处理乘积2更容易。

If the set of possible values for 𝛉\mathbf{\theta} isn’t finite, then we need to optimize eq. 5 directly using a numerical optimization routine, such as gradient descent, which we consider in Chapter 4. Usually, we optimize the natural logarithm of the right-hand side expression in eq. 5 because the logarithm of a product becomes the sum of logarithms and it’s easier for the machine to work with the sum than with a product2.

2.6参数与超参数

2.6 Parameters vs. Hyperparameters

超参数是学习算法的一个属性,通常(但不总是)具有数值。该值影响算法的工作方式。超参数不是由算法本身从数据中学习的。它们必须由数据分析师在运行算法之前设置。我将在第 5 章中展示如何做到这一点。

A hyperparameter is a property of a learning algorithm, usually (but not always) having a numerical value. That value influences the way the algorithm works. Hyperparameters aren’t learned by the algorithm itself from data. They have to be set by the data analyst before running the algorithm. I show how to do that in Chapter 5.

参数是定义学习算法学习的模型的变量。学习算法根据训练数据直接修改参数。学习的目标是找到使模型在某种意义上最优的参数值。

Parameters are variables that define the model learned by the learning algorithm. Parameters are directly modified by the learning algorithm based on the training data. The goal of learning is to find such values of parameters that make the model optimal in a certain sense.

2.7分类与回归

2.7 Classification vs. Regression

分类是自动为未标记的示例分配标签的问题。垃圾邮件检测是分类的一个著名例子。

Classification is a problem of automatically assigning a label to an unlabeled example. Spam detection is a famous example of classification.

在机器学习中,分类问题是通过分类学习算法来解决的,该算法将标记示例的集合作为输入,并生成一个模型,该模型可以将未标记示例作为输入,并直接输出标签或输出可以使用的数字分析师推断标签。这种数字的一个例子是概率。

In machine learning, the classification problem is solved by a classification learning algorithm that takes a collection of labeled examples as inputs and produces a model that can take an unlabeled example as input and either directly output a label or output a number that can be used by the analyst to deduce the label. An example of such a number is a probability.

在分类问题中,标签是有限类集的成员。如果类别集的大小为两个(“sick”/“healthy”、“spam”/“not_spam”),我们就讨论二元分类(在某些来源中也称为二项式)。多类分类(也称为多项式)是具有三个或更多类的分类问题3

In a classification problem, a label is a member of a finite set of classes. If the size of the set of classes is two (“sick”/“healthy”, “spam”/“not_spam”), we talk about binary classification (also called binomial in some sources). Multiclass classification (also called multinomial) is a classification problem with three or more classes3.

虽然某些学习算法自然允许两个以上的类别,但其他学习算法本质上是二元分类算法。有一些策略允许将二元分类学习算法转变为多类学习算法。我将在第 7 章中讨论其中之一。

While some learning algorithms naturally allow for more than two classes, others are by nature binary classification algorithms. There are strategies allowing to turn a binary classification learning algorithm into a multiclass one. I talk about one of them in Chapter 7.

回归是在给定未标记示例的情况下预测实值标签(通常称为目标)的问题。根据房屋特征(例如面积、卧室数量、位置等)估算房价估值是回归的一个著名例子。

Regression is a problem of predicting a real-valued label (often called a target) given an unlabeled example. Estimating house price valuation based on house features, such as area, the number of bedrooms, location and so on is a famous example of regression.

回归问题通过回归学习算法来解决,该算法将标记示例的集合作为输入,并生成一个模型,该模型可以将未标记示例作为输入并输出目标。

The regression problem is solved by a regression learning algorithm that takes a collection of labeled examples as inputs and produces a model that can take an unlabeled example as input and output a target.

2.8基于模型的学习与基于实例的学习

2.8 Model-Based vs. Instance-Based Learning

大多数监督学习算法都是基于模型的。我们已经见过这样一种算法:SVM。基于模型的学习算法使用训练数据来创建具有从训练数据中学习的参数模型。在SVM中,我们看到的两个参数是𝐰*\mathbf{w}^**b^*。模型建立后,训练数据可以被丢弃。

Most supervised learning algorithms are model-based. We have already seen one such algorithm: SVM. Model-based learning algorithms use the training data to create a model that has parameters learned from the training data. In SVM, the two parameters we saw were 𝐰*\mathbf{w}^* and b*b^*. After the model was built, the training data can be discarded.

基于实例的学习算法使用整个数据集作为模型。实践中经常使用的一种基于实例的算法是k 最近邻算法(kNN)。在分类中,为了预测输入示例的标签,kNN 算法会查看特征向量空间中输入示例的近邻,并输出在该近邻中最常看到的标签。

Instance-based learning algorithms use the whole dataset as the model. One instance-based algorithm frequently used in practice is k-Nearest Neighbors (kNN). In classification, to predict a label for an input example the kNN algorithm looks at the close neighborhood of the input example in the space of feature vectors and outputs the label that it saw the most often in this close neighborhood.

2.9浅层学习与深度学习

2.9 Shallow vs. Deep Learning

浅层学习算法直接从训练示例的特征中学习模型的参数。大多数监督学习算法都很肤浅。臭名昭著的例外是神经网络学习算法,特别是那些构建输入和输出之间具有多个层的神经网络的算法。这种神经网络称为深度神经网络。在深度神经网络学习(或者简称深度学习)中,与浅层学习相反,大多数模型参数不是直接从训练示例的特征中学习,而是从前面层的输出中学习。

A shallow learning algorithm learns the parameters of the model directly from the features of the training examples. Most supervised learning algorithms are shallow. The notorious exceptions are neural network learning algorithms, specifically those that build neural networks with more than one layer between input and output. Such neural networks are called deep neural networks. In deep neural network learning (or, simply, deep learning), contrary to shallow learning, most model parameters are learned not directly from the features of the training examples, but from the outputs of the preceding layers.

如果您现在不明白这意味着什么,请不要担心。我们将在第 6 章中更仔细地研究神经网络。

Don’t worry if you don’t understand what that means right now. We look at neural networks more closely in Chapter 6.


  1. 事实上,等式。  3定义了实践中最常用的概率分布之一(称为高斯分布正态分布)的 pdf ,表示为𝒩μ,σ2\mathcal{N}(\mu,\sigma^{2})

  2. In fact, eq. 3 defines the pdf of one of the most frequently used in practice probability distributions called Gaussian distribution or normal distribution and denoted as 𝒩(μ,σ2)\mathcal{N}(\mu,\sigma^{2}).

  3. 许多数字相乘可以得到非常小的结果或非常大的结果。当机器无法在内存中存储如此极端的数字时,常常会导致数值溢出的问题。

  4. Multiplication of many numbers can give either a very small result or a very large one. It often results in the problem of numerical overflow when the machine cannot store such extreme numbers in memory.

  5. 但每个示例仍然有一个标签。

  6. There’s still one label per example though.

3基本算法

3 Fundamental Algorithms

在本章中,我描述了五种算法,它们不仅是最著名的,而且本身也非常有效,或者用作最有效的学习算法的构建块。

In this chapter, I describe five algorithms which are not just the most known but also either very effective on their own or are used as building blocks for the most effective learning algorithms out there.

3.1线性回归

3.1 Linear Regression

线性回归是一种流行的回归学习算法,它学习的模型是输入示例特征的线性组合。

Linear regression is a popular regression learning algorithm that learns a model which is a linear combination of features of the input example.

3.1.1问题陈述

3.1.1 Problem Statement

我们有一系列带标签的示例{𝐱,y}=1\{(\mathbf{x}_i, y_i)\}_{i=1}^N, 在哪里N是集合的大小,𝐱\mathbf{x}_i是个DD示例的维特征向量=1,……,i = 1,\ldots,N,yy_i是一个实值1目标并且每个特征Xjx_i^{(j)},j=1,……,Dj=1,\ldots,D,也是一个实数。

We have a collection of labeled examples {(𝐱i,yi)}i=1N\{(\mathbf{x}_i, y_i)\}_{i=1}^N, where NN is the size of the collection, 𝐱i\mathbf{x}_i is the DD-dimensional feature vector of example i=1,,Ni = 1,\ldots,N, yiy_i is a real-valued1 target and every feature xi(j)x_i^{(j)}, j=1,,Dj=1,\ldots,D, is also a real number.

我们想要建立一个模型F𝐰,𝐱f_{\mathbf{w}, b}(\mathbf{x})作为示例特征的线性组合𝐱\mathbf{x}:F𝐰,𝐱=𝐰𝐱+,6 f_{\mathbf{w}, b}(\mathbf{x}) = \mathbf{w}\mathbf{x} + b, \qquad(6)在哪里𝐰\mathbf{w}是一个DD参数的维向量和b是一个实数。符号F𝐰,f_{\mathbf{w}, b}意味着模型Ff由两个值参数化:𝐰\mathbf{w}b

We want to build a model f𝐰,b(𝐱)f_{\mathbf{w}, b}(\mathbf{x}) as a linear combination of features of example 𝐱\mathbf{x}: f𝐰,b(𝐱)=𝐰𝐱+b,(6) f_{\mathbf{w}, b}(\mathbf{x}) = \mathbf{w}\mathbf{x} + b, \qquad(6) where 𝐰\mathbf{w} is a DD-dimensional vector of parameters and bb is a real number. The notation f𝐰,bf_{\mathbf{w}, b} means that the model ff is parametrized by two values: 𝐰\mathbf{w} and bb.

我们将使用该模型来预测未知yy对于给定的𝐱\mathbf{x}像这样:yF𝐰,𝐱y \leftarrow f_{\mathbf{w}, b}(\mathbf{x})。由两个不同对参数化的两个模型𝐰,(\mathbf{w}, b)当应用于同一示例时,可能会产生两个不同的预测。我们想要找到最优值𝐰*,*(\mathbf{w}^*, b^*)。显然,参数的最佳值定义了做出最准确预测的模型。

We will use the model to predict the unknown yy for a given 𝐱\mathbf{x} like this: yf𝐰,b(𝐱)y \leftarrow f_{\mathbf{w}, b}(\mathbf{x}). Two models parametrized by two different pairs (𝐰,b)(\mathbf{w}, b) will likely produce two different predictions when applied to the same example. We want to find the optimal values (𝐰*,b*)(\mathbf{w}^*, b^*). Obviously, the optimal values of parameters define the model that makes the most accurate predictions.

您可能已经注意到方程中线性模型的形式。 图6与SVM模型的形式非常相似。唯一的区别就是缺少了符号\operatorname{sign}操作员。这两个模型确实很相似。然而,SVM 中的超平面扮演着决策边界的角色:它用于将两组示例彼此分开。因此,它必须尽可能远离每个组。

You could have noticed that the form of our linear model in eq. 6 is very similar to the form of the SVM model. The only difference is the missing sign\operatorname{sign} operator. The two models are indeed similar. However, the hyperplane in the SVM plays the role of the decision boundary: it’s used to separate two groups of examples from one another. As such, it has to be as far from each group as possible.

另一方面,线性回归中的超平面被选择为尽可能接近所有训练样本。

On the other hand, the hyperplane in linear regression is chosen to be as close to all training examples as possible.

图 7:一维示例的线性回归。
图 7:一维示例的线性回归。

通过查看图 1 中的插图,您可以了解为什么后一个要求至关重要。  7 .它显示一维示例(蓝点)的回归线(红色)。我们可以使用这条线来预测目标的值ynewy_{new}对于新的未标记输入示例Xnewx_{new}。如果我们的例子是DD维特征向量(对于D>1D > 1),与一维情况的唯一区别是回归模型不是一条线,而是一个平面(对于二维)或超平面(对于D>2D > 2)。

You can see why this latter requirement is essential by looking at the illustration in fig. 7. It displays the regression line (in red) for one-dimensional examples (blue dots). We can use this line to predict the value of the target ynewy_{new} for a new unlabeled input example xnewx_{new}. If our examples are DD-dimensional feature vectors (for D>1D > 1), the only difference with the one-dimensional case is that the regression model is not a line but a plane (for two dimensions) or a hyperplane (for D>2D > 2).

现在您明白为什么要求回归超平面尽可能接近训练示例是至关重要的:如果图中的红线。  7离蓝点很远,预测ynewy_{new}正确的机会就会减少。

Now you see why it’s essential to have the requirement that the regression hyperplane lies as close to the training examples as possible: if the red line in fig. 7 was far from the blue dots, the prediction ynewy_{new} would have fewer chances to be correct.

3.1.2解决方案

3.1.2 Solution

为了满足后一个要求,我们使用优化过程来寻找最佳值𝐰*\mathbf{w^*}*b^*尝试最小化以下表达式:

To get this latter requirement satisfied, the optimization procedure which we use to find the optimal values for 𝐰*\mathbf{w^*} and b*b^* tries to minimize the following expression:

1Σ=1……F𝐰,𝐱-y27 \frac{1}{N} \sum_{i=1 \ldots N} (f_{\mathbf{w}, b}(\mathbf{x}_i) - y_i)^2. \qquad(7)

1Ni=1N(f𝐰,b(𝐱i)yi)2.(7) \frac{1}{N} \sum_{i=1 \ldots N} (f_{\mathbf{w}, b}(\mathbf{x}_i) - y_i)^2. \qquad(7)

在数学中,我们最小化或最大化的表达式称为目标函数,或者简称为目标。表达方式F𝐰,𝐱-y2(f_{\mathbf{w}, b}(\mathbf{x}_i) - y_i)^2上述目标中的称为损失函数。这是对示例错误分类的惩罚措施i。损失函数的这种特殊选择称为平方误差损失。所有基于模型的学习算法都有一个损失函数,为了找到最佳模型,我们所做的就是尝试最小化称为成本函数的目标。在线性回归中,成本函数由平均损失给出,也称为经验风险。模型的平均损失或经验风险是将模型应用于训练数据而获得的所有惩罚的平均值。

In mathematics, the expression we minimize or maximize is called an objective function, or, simply, an objective. The expression (f𝐰,b(𝐱i)yi)2(f_{\mathbf{w}, b}(\mathbf{x}_i) - y_i)^2 in the above objective is called the loss function. It’s a measure of penalty for misclassification of example ii. This particular choice of the loss function is called squared error loss. All model-based learning algorithms have a loss function and what we do to find the best model is we try to minimize the objective known as the cost function. In linear regression, the cost function is given by the average loss, also called the empirical risk. The average loss, or empirical risk, for a model, is the average of all penalties obtained by applying the model to the training data.

为什么线性回归中的损失是二次函数?为什么我们不能得到真实目标之间差异的绝对值yy_i和预测值F𝐱f(\mathbf{x}_i)并以此作为惩罚?我们可以。此外,我们还可以使用立方体来代替正方形。

Why is the loss in linear regression a quadratic function? Why couldn’t we get the absolute value of the difference between the true target yiy_i and the predicted value f(𝐱i)f(\mathbf{x}_i) and use that as a penalty? We could. Moreover, we also could use a cube instead of a square.

现在您可能开始意识到,当我们设计机器学习算法时,会做出多少看似随意的决定:我们决定使用特征的线性组合来预测目标。但是,我们可以使用平方或其他多项式来组合特征值。我们还可以使用其他一些有意义的损失函数:F𝐱f(\mathbf{x}_i)yy_i也有道理,差的立方也是如此;二元损失11什么时候F𝐱f(\mathbf{x}_i)yy_i是不同的并且00当它们相同时)也有意义,对吧?

Now you probably start realizing how many seemingly arbitrary decisions are made when we design a machine learning algorithm: we decided to use the linear combination of features to predict the target. However, we could use a square or some other polynomial to combine the values of features. We could also use some other loss function that makes sense: the absolute difference between f(𝐱i)f(\mathbf{x}_i) and yiy_i makes sense, the cube of the difference too; the binary loss (11 when f(𝐱i)f(\mathbf{x}_i) and yiy_i are different and 00 when they are the same) also makes sense, right?

如果我们对模型的形式、损失函数的形式以及最小化平均损失以找到最佳参数值的算法的选择做出不同的决定,我们最终会发明一种不同的机器学习算法。听起来很容易,不是吗?但是,不要急于发明新的学习算法。它的不同并不意味着它在实践中会更好地工作。

If we made different decisions about the form of the model, the form of the loss function, and about the choice of the algorithm that minimizes the average loss to find the best values of parameters, we would end up inventing a different machine learning algorithm. Sounds easy, doesn’t it? However, do not rush to invent a new learning algorithm. The fact that it’s different doesn’t mean that it will work better in practice.

人们发明新的学习算法有两个主要原因之一:

People invent new learning algorithms for one of the two main reasons:

  1. 新算法比现有算法更好地解决了特定的实际问题。
  2. The new algorithm solves a specific practical problem better than the existing algorithms.
  3. 新算法对其产生的模型质量有更好的理论保证。
  4. The new algorithm has better theoretical guarantees on the quality of the model it produces.

选择模型线性形式的一个实际理由是它很简单。当可以使用简单的模型时,为什么要使用复杂的模型呢?另一个考虑因素是线性模型很少会过度拟合。过度拟合是模型的属性,使得模型可以很好地预测训练期间使用的示例的标签,但在应用于训练期间学习算法未看到的示例时经常会出错。

One practical justification of the choice of the linear form for the model is that it’s simple. Why use a complex model when you can use a simple one? Another consideration is that linear models rarely overfit. Overfitting is the property of a model such that the model predicts very well labels of the examples used during training but frequently makes errors when applied to examples that weren’t seen by the learning algorithm during training.

图 8:过度拟合。
图 8:过度拟合。

回归中过度拟合的一个例子如图 2 所示。  8 .用于构建红色回归线的数据与图 1 中的相同。  7 .不同的是,这次是次数为多项式的多项式回归1010。回归线几乎完美地预测了几乎所有训练示例的目标,但可能会在新数据上产生重大错误,如图 2 所示。  7Xnewx_{new}。我们将在第 5 章中详细讨论过度拟合以及如何避免过度拟合。

An example of overfitting in regression is shown in fig. 8. The data used to build the red regression line is the same as in fig. 7. The difference is that this time, this is the polynomial regression with a polynomial of degree 1010. The regression line predicts almost perfectly the targets almost all training examples, but will likely make significant errors on new data, as you can see in fig. 7 for xnewx_{new}. We talk more about overfitting and how to avoid it in Chapter 5.

现在您知道为什么线性回归很有用了:它不会过度拟合。但是平方损失呢?为什么我们决定将其平方? 1805年,法国数学家阿德里安·玛丽·勒让德(Adrien-Marie Legendre)首次发表了衡量模型质量的平方和法,他指出在求和之前先对误差求平方很方便。他为什么这么说?绝对值不方便,因为它没有连续导数,这使得函数不平滑。当使用线性代数寻找优化问题的封闭式解时,不平滑的函数会产生不必要的困难。寻找函数最优值的封闭式解决方案是简单的代数表达式,通常比使用复杂的数值优化方法更可取,例如梯度下降(用于训练神经网络)。

Now you know why linear regression can be useful: it doesn’t overfit much. But what about the squared loss? Why did we decide that it should be squared? In 1805, the French mathematician Adrien-Marie Legendre, who first published the sum of squares method for gauging the quality of the model stated that squaring the error before summing is convenient. Why did he say that? The absolute value is not convenient, because it doesn’t have a continuous derivative, which makes the function not smooth. Functions that are not smooth create unnecessary difficulties when employing linear algebra to find closed form solutions to optimization problems. Closed form solutions to finding an optimum of a function are simple algebraic expressions and are often preferable to using complex numerical optimization methods, such as gradient descent (used, among others, to train neural networks).

直观上,平方惩罚也是有利的,因为它们根据差异的值夸大了真实目标和预测目标之间的差异。我们也可以使用 3 或 4 的幂,但它们的导数使用起来更复杂。

Intuitively, squared penalties are also advantageous because they exaggerate the difference between the true target and the predicted one according to the value of this difference. We might also use the powers 3 or 4, but their derivatives are more complicated to work with.

最后,为什么我们关心平均损失的导数?如果我们可以计算方程中函数的梯度:  7,然后我们可以将此梯度设置为零2并找到方程组的解,为我们提供最佳值𝐰*\mathbf{w}^**b^*

Finally, why do we care about the derivative of the average loss? If we can calculate the gradient of the function in eq. 7, we can then set this gradient to zero2 and find the solution to a system of equations that gives us the optimal values 𝐰*\mathbf{w}^* and b*b^*.

3.2逻辑回归

3.2 Logistic Regression

首先要说的是,逻辑回归不是回归,而是一种分类学习算法。该名称源自统计学,是因为逻辑回归的数学公式与线性回归相似。

The first thing to say is that logistic regression is not a regression, but a classification learning algorithm. The name comes from statistics and is due to the fact that the mathematical formulation of logistic regression is similar to that of linear regression.

我以二元分类为例解释逻辑回归。然而,它自然可以扩展到多类分类。

I explain logistic regression on the case of binary classification. However, it can naturally be extended to multiclass classification.

3.2.1问题陈述

3.2.1 Problem Statement

在逻辑回归中,我们仍然想要建模yy_i作为线性函数𝐱\mathbf{x}_i,但是,用二进制yy_i这并不简单。特征的线性组合,例如𝐰𝐱+\mathbf{w}\mathbf{x}_i + b是一个从负无穷大到正无穷大的函数,而yy_i只有两个可能的值。

In logistic regression, we still want to model yiy_i as a linear function of 𝐱i\mathbf{x}_i, however, with a binary yiy_i this is not straightforward. The linear combination of features such as 𝐰𝐱i+b\mathbf{w}\mathbf{x}_i + b is a function that spans from minus infinity to plus infinity, while yiy_i has only two possible values.

在没有计算机需要科学家进行手动计算的时代,他们渴望找到一种线性分类模型。他们发现如果我们将负面标签定义为00和正标签为11,我们只需要找到一个简单的连续函数,其余域为0,1(0,1)。在这种情况下,如果模型返回的输入值𝐱\mathbf{x}更接近于00,然后我们分配一个负标签𝐱\mathbf{x};否则,该示例被标记为正例。具有这种属性的函数之一是标准逻辑函数(也称为sigmoid 函数):

At the time where the absence of computers required scientists to perform manual calculations, they were eager to find a linear classification model. They figured out that if we define a negative label as 00 and the positive label as 11, we would just need to find a simple continuous function whose codomain is (0,1)(0,1). In such a case, if the value returned by the model for input 𝐱\mathbf{x} is closer to 00, then we assign a negative label to 𝐱\mathbf{x}; otherwise, the example is labeled as positive. One function that has such a property is the standard logistic function (also known as the sigmoid function):

FX=11+e-X, f(x)=\frac{1}{1+e^{-x}},

f(x)=11+ex, f(x)=\frac{1}{1+e^{-x}},

在哪里ee是自然对数的底(也称为欧拉数eXe^x在编程语言中也称为exp(x)函数)。其图形如图所示。  9 .

where ee is the base of the natural logarithm (also called Euler’s number; exe^x is also known as the exp(x) function in programming languages). Its graph is depicted in fig. 9.

逻辑回归模型如下所示:

The logistic regression model looks like this:

F𝐰,𝐱=定义11+e-𝐰𝐱+8 f_{\mathbf{w}, b}(\mathbf{x}) \stackrel{\text{def}}{=} \frac{1}{1+e^{-(\mathbf{w}\mathbf{x} + b)}}. \qquad(8)

f𝐰,b(𝐱)=def11+e(𝐰𝐱+b).(8) f_{\mathbf{w}, b}(\mathbf{x}) \stackrel{\text{def}}{=} \frac{1}{1+e^{-(\mathbf{w}\mathbf{x} + b)}}. \qquad(8)

你可以看到这个熟悉的术语𝐰𝐱+\mathbf{w}\mathbf{x} + b从线性回归。

You can see the familiar term 𝐰𝐱+b\mathbf{w}\mathbf{x} + b from linear regression.

通过查看标准逻辑函数的图表,我们可以看到它有多符合我们的分类目的:如果我们优化𝐰\mathbf{w}b适当地,我们可以解释输出F𝐱f(\mathbf{x})作为概率yy_i积极。例如,如果它高于或等于阈值0.50.5我们会说这个类𝐱\mathbf{x}是积极的;否则,结果为负。实际上,阈值的选择可能会根据问题的不同而有所不同。当我们谈论模型性能评估时,我们会在第五章中回到这个讨论。

By looking at the graph of the standard logistic function, we can see how well it fits our classification purpose: if we optimize the values of 𝐰\mathbf{w} and bb appropriately, we could interpret the output of f(𝐱)f(\mathbf{x}) as the probability of yiy_i being positive. For example, if it’s higher than or equal to the threshold 0.50.5 we would say that the class of 𝐱\mathbf{x} is positive; otherwise, it’s negative. In practice, the choice of the threshold could be different depending on the problem. We return to this discussion in Chapter 5 when we talk about model performance assessment.

现在,我们如何找到最优的𝐰*\mathbf{w}^**b^*?在线性回归中,我们最小化了经验风险,经验风险被定义为平均平方误差损失,也称为均方误差或 MSE。

Now, how do we find optimal 𝐰*\mathbf{w}^* and b*b^*? In linear regression, we minimized the empirical risk which was defined as the average squared error loss, also known as the mean squared error or MSE.

图 9:标准逻辑函数。
图 9:标准逻辑函数。

3.2.2解决方案

3.2.2 Solution

另一方面,在逻辑回归中,我们根据模型最大化训练集的可能性。在统计学中,似然函数定义了根据我们的模型观察(示例)的可能性。

In logistic regression, on the other hand, we maximize the likelihood of our training set according to the model. In statistics, the likelihood function defines how likely the observation (an example) is according to our model.

例如,让我们有一个带标签的示例𝐱,y(\mathbf{x}_i, y_i)在我们的训练数据中。还假设我们发现(猜测)一些特定值𝐰̂\mathbf{\hat{w}}̂\hat{b}我们的参数。如果我们现在应用我们的模型F𝐰̂,̂f_{\mathbf{\hat{w}},\hat{b}}𝐱\mathbf{x}_i使用等式 8我们会得到一些值0<p<10 < p < 1作为输出。如果yy_i是正类,概率为yy_i根据我们的模型,正类由下式给出pp。同样,如果yy_i是负类,它是负类的可能性由下式给出1-p1-p

For instance, let’s have a labeled example (𝐱i,yi)(\mathbf{x}_i, y_i) in our training data. Assume also that we found (guessed) some specific values 𝐰̂\mathbf{\hat{w}} and b̂\hat{b} of our parameters. If we now apply our model f𝐰̂,b̂f_{\mathbf{\hat{w}},\hat{b}} to 𝐱i\mathbf{x}_i using eq. 8 we will get some value 0<p<10 < p < 1 as output. If yiy_i is the positive class, the likelihood of yiy_i being the positive class, according to our model, is given by pp. Similarly, if yiy_i is the negative class, the likelihood of it being the negative class is given by 1p1-p.

逻辑回归中的优化标准称为最大似然。我们现在不是像线性回归那样最小化平均损失,而是根据我们的模型最大化训练数据的可能性:

The optimization criterion in logistic regression is called maximum likelihood. Instead of minimizing the average loss, like in linear regression, we now maximize the likelihood of the training data according to our model:

L𝐰,=定义=1……F𝐰,𝐱y1-F𝐰,𝐱1-y9 \begin{split} L_{\mathbf{w}, b} \stackrel{\text{def}}{=} \prod_{i=1 \ldots N} f_{\mathbf{w}, b}(\mathbf{x}_i)^{y_i}\\ \cdot(1 - f_{\mathbf{w}, b}(\mathbf{x}_i))^{(1 - y_i)}.\end{split} \qquad(9)

L𝐰,b=defi=1Nf𝐰,b(𝐱i)yi(1f𝐰,b(𝐱i))(1yi).(9) \begin{split} L_{\mathbf{w}, b} \stackrel{\text{def}}{=} \prod_{i=1 \ldots N} f_{\mathbf{w}, b}(\mathbf{x}_i)^{y_i}\\ \cdot(1 - f_{\mathbf{w}, b}(\mathbf{x}_i))^{(1 - y_i)}.\end{split} \qquad(9)

表达方式F𝐰,𝐱y1-F𝐰,𝐱1-yf_{\mathbf{w}, b}(\mathbf{x})^{y_i}(1 - f_{\mathbf{w}, b}(\mathbf{x}))^{(1 - y_i)}可能看起来很可怕,但这只是一种奇特的数学表达方式:“F𝐰,𝐱f_{\mathbf{w}, b}(\mathbf{x})什么时候y=1y_i = 11-F𝐰,𝐱(1 - f_{\mathbf{w}, b}(\mathbf{x}))否则”。确实,如果y=1y_i = 1, 然后1-F𝐰,𝐱1-y(1 - f_{\mathbf{w}, b}(\mathbf{x}))^{(1 - y_i)}等于11因为1-y=0(1 - y_i) = 0我们知道任何力量00等于11。另一方面,如果y=0y_i = 0, 然后F𝐰,𝐱yf_{\mathbf{w}, b}(\mathbf{x})^{y_i}等于11为了同样的原因。

The expression f𝐰,b(𝐱)yi(1f𝐰,b(𝐱))(1yi)f_{\mathbf{w}, b}(\mathbf{x})^{y_i}(1 - f_{\mathbf{w}, b}(\mathbf{x}))^{(1 - y_i)} may look scary but it’s just a fancy mathematical way of saying: “f𝐰,b(𝐱)f_{\mathbf{w}, b}(\mathbf{x}) when yi=1y_i = 1 and (1f𝐰,b(𝐱))(1 - f_{\mathbf{w}, b}(\mathbf{x})) otherwise”. Indeed, if yi=1y_i = 1, then (1f𝐰,b(𝐱))(1yi)(1 - f_{\mathbf{w}, b}(\mathbf{x}))^{(1 - y_i)} equals 11 because (1yi)=0(1 - y_i) = 0 and we know that anything power 00 equals 11. On the other hand, if yi=0y_i = 0, then f𝐰,b(𝐱)yif_{\mathbf{w}, b}(\mathbf{x})^{y_i} equals 11 for the same reason.

您可能已经注意到我们使用了乘积运算符\prod在目标函数中而不是求和运算符Σ\sum用于线性回归。这是因为观察到的可能性N标签为N示例是每个观察结果的可能性的乘积(假设所有观察结果彼此独立,情况就是如此)。您可以与概率论中一系列独立实验中结果概率的乘法进行类比。

You may have noticed that we used the product operator \prod in the objective function instead of the sum operator \sum which was used in linear regression. It’s because the likelihood of observing NN labels for NN examples is the product of likelihoods of each observation (assuming that all observations are independent of one another, which is the case). You can draw a parallel with the multiplication of probabilities of outcomes in a series of independent experiments in the probability theory.

因为eXpexp模型中使用的函数,实际上更方便,避免数值溢出,最大化对数似然而不是似然。对数似然定义如下:

Because of the expexp function used in the model, in practice, it’s more convenient, to avoid numerical overflow, to maximize the log-likelihood instead of likelihood. The log-likelihood is defined as follows:

LGL𝐰,=定义L𝐰,𝐱=Σ=1[yF𝐰,𝐱+1-y1-F𝐰,𝐱]\begin{aligned} LogL_{\mathbf{w}, b} &\stackrel{\text{def}}{=} \ln(L(_{\mathbf{w}, b}(\mathbf{x})) \\ &= \sum_{i = 1}^N \big[y_i \ln{f_{\mathbf{w}, b}(\mathbf{x})} \\ &\,\,\,\,\,\, + (1 - y_i)\ln{(1 - f_{\mathbf{w}, b}(\mathbf{x}))}\big].\end{aligned}

LogL𝐰,b=defln(L(𝐰,b(𝐱))=i=1N[yilnf𝐰,b(𝐱)+(1yi)ln(1f𝐰,b(𝐱))].\begin{aligned} LogL_{\mathbf{w}, b} &\stackrel{\text{def}}{=} \ln(L(_{\mathbf{w}, b}(\mathbf{x})) \\ &= \sum_{i = 1}^N \big[y_i \ln{f_{\mathbf{w}, b}(\mathbf{x})} \\ &\,\,\,\,\,\, + (1 - y_i)\ln{(1 - f_{\mathbf{w}, b}(\mathbf{x}))}\big].\end{aligned}

因为\ln是一个严格递增函数,最大化该函数与最大化其参数相同,并且这个新优化问题的解与原始问题的解相同。

Because ln\ln is a strictly increasing function, maximizing this function is the same as maximizing its argument, and the solution to this new optimization problem is the same as the solution to the original problem.

与线性回归相反,上述优化问题没有封闭形式的解决方案。在这种情况下使用的典型数值优化过程是梯度下降。我们将在下一章讨论它。

Contrary to linear regression, there’s no closed form solution to the above optimization problem. A typical numerical optimization procedure used in such cases is gradient descent. We talk about it in the next chapter.

3.3决策树学习

3.3 Decision Tree Learning

决策树是一种可用于做出决策的非循环。在图的每个分支节点中,都有一个特定的特征jj检查特征向量的。如果特征值低于特定阈值,则遵循左分支;否则,遵循正确的分支。当到达叶节点时,将决定示例所属的类。

A decision tree is an acyclic graph that can be used to make decisions. In each branching node of the graph, a specific feature jj of the feature vector is examined. If the value of the feature is below a specific threshold, then the left branch is followed; otherwise, the right branch is followed. As the leaf node is reached, the decision is made about the class to which the example belongs.

正如本节标题所示,可以从数据中学习决策树。

As the title of the section suggests, a decision tree can be learned from data.

3.3.1问题陈述

3.3.1 Problem Statement

像以前一样,我们有一组带标签的示例;标签属于该集合{0,1}\{0,1\}。我们想要构建一个决策树,它允许我们在给定特征向量的情况下预测类别。

Like previously, we have a collection of labeled examples; labels belong to the set {0,1}\{0,1\}. We want to build a decision tree that would allow us to predict the class given a feature vector.

3.3.2解决方案

3.3.2 Solution

决策树学习算法有多种形式。在本书中,我们只考虑一个,称为ID3

There are various formulations of the decision tree learning algorithm. In this book, we consider just one, called ID3.

在这种情况下,优化标准是平均对数似然:

The optimization criterion, in this case, is the average log-likelihood:

1Σ=1[yFD3𝐱+1-y1-FD3𝐱],10 \begin{split}\frac{1}{N} \sum_{i=1}^N \Big[y_i \ln{f_{ID3}(\mathbf{x}_i)} \\ + (1 - y_i)\ln{(1 - f_{ID3}(\mathbf{x}_i))}\Big],\end{split} \qquad(10)

1Ni=1N[yilnfID3(𝐱i)+(1yi)ln(1fID3(𝐱i))],(10) \begin{split}\frac{1}{N} \sum_{i=1}^N \Big[y_i \ln{f_{ID3}(\mathbf{x}_i)} \\ + (1 - y_i)\ln{(1 - f_{ID3}(\mathbf{x}_i))}\Big],\end{split} \qquad(10)

在哪里FD3f_{ID3}是一棵决策树。

where fID3f_{ID3} is a decision tree.

到目前为止,它看起来与逻辑回归非常相似。然而,与构建参数模型的逻辑回归学习算法相反 F𝐰*,*f_{\mathbf{w}^*, b^*}通过寻找优化准则的最优解,ID3算法通过构建非参数模型对其进行近似优化 FD3𝐱=定义普罗y=1|𝐱f_{ID3}(\mathbf{x}) \stackrel{\text{def}}{=} \Pr(y=1|\mathbf{x})

By now, it looks very similar to logistic regression. However, contrary to the logistic regression learning algorithm which builds a parametric model f𝐰*,b*f_{\mathbf{w}^*, b^*} by finding an optimal solution to the optimization criterion, the ID3 algorithm optimizes it approximately by constructing a nonparametric model fID3(𝐱)=defPr(y=1|𝐱)f_{ID3}(\mathbf{x}) \stackrel{\text{def}}{=} \Pr(y=1|\mathbf{x}).

图 10:决策树构建算法的图示。一开始,决策树只包含起始节点;它对任何输入做出相同的预测。
图 10:决策树构建算法的图示。一开始,决策树只包含起始节点;它对任何输入做出相同的预测。
图 11:决策树构建算法的图示。第一次分裂后的决策树;它测试特征 3 是否小于 18.3,并根据结果在两个叶节点之一进行预测。
图 11:决策树构建算法的图示。第一次分裂后的决策树;它测试功能是否33小于18.318.3并且根据结果,在两个叶节点之一中进行预测。

ID3学习算法的工作原理如下。让𝒮\mathcal{S}表示一组带标签的示例。一开始,决策树只有一个包含所有示例的起始节点:𝒮=定义{𝐱,y}=1\mathcal{S} \stackrel{\text{def}}{=} \{(\mathbf{x}_i, y_i)\}_{i=1}^N。从恒定模型开始FD3Sf_{ID3}^S定义为,

The ID3 learning algorithm works as follows. Let 𝒮\mathcal{S} denote a set of labeled examples. In the beginning, the decision tree only has a start node that contains all examples: 𝒮=def{(𝐱i,yi)}i=1N\mathcal{S} \stackrel{\text{def}}{=} \{(\mathbf{x}_i, y_i)\}_{i=1}^N. Start with a constant model fID3Sf_{ID3}^S defined as,

FD3𝒮=定义1|𝒮|Σ𝐱,yε𝒮y11 f_{ID3}^{\mathcal{S}} \stackrel{\text{def}}{=} \frac{1}{|\mathcal{S}|} \sum_{(\mathbf{x}, y) \in \mathcal{S}} y. \qquad(11)

fID3𝒮=def1|𝒮|(𝐱,y)𝒮y.(11) f_{ID3}^{\mathcal{S}} \stackrel{\text{def}}{=} \frac{1}{|\mathcal{S}|} \sum_{(\mathbf{x}, y) \in \mathcal{S}} y. \qquad(11)

上述模型给出的预测,FD3S𝐱f^S_{ID3}(\mathbf{x}),对于任何输入都相同𝐱\mathbf{x}。使用玩具数据集构建的相应决策树1212标记示例如图所示。  10 .

The prediction given by the above model, fID3S(𝐱)f^S_{ID3}(\mathbf{x}), would be the same for any input 𝐱\mathbf{x}. The corresponding decision tree built using a toy dataset of 1212 labeled examples is shown in fig. 10.

然后我们搜索所有特征j=1,……,Dj = 1,\ldots,D和所有阈值tt,并分割集合SS分为两个子集:𝒮-=定义{𝐱,y|𝐱,yε𝒮,Xj<t}\mathcal{S}_{-} \stackrel{\text{def}}{=} \{(\mathbf{x}, y)\, |\, (\mathbf{x}, y) \in \mathcal{S}, x^{(j)} < t\}𝒮+=定义{𝐱,y|𝐱,yε𝑆,Xjt}\mathcal{S}_{+} \stackrel{\text{def}}{=} \{(\mathbf{x}, y)\, |\, (\mathbf{x}, y) \in \mathit{S}, x^{(j)} \geq t\}。两个新的子集将进入两个新的叶节点,并且我们对所有可能的对进行评估j,t(j, t)分割成碎片有多好𝒮-\mathcal{S}_{-}𝒮+\mathcal{S}_{+}是。最后,我们选择最好的值j,t(j, t), 分裂𝒮\mathcal{S}进入𝒮+\mathcal{S}_{+}𝒮-\mathcal{S}_{-},形成两个新的叶子节点,并继续递归𝒮+\mathcal{S}_{+}𝒮-\mathcal{S}_{-}(或者如果没有分割产生比当前模型足够好的模型则退出)。一次分裂后的决策树如图 1 所示。  11 .

Then we search through all features j=1,,Dj = 1,\ldots,D and all thresholds tt, and split the set SS into two subsets: 𝒮=def{(𝐱,y)|(𝐱,y)𝒮,x(j)<t}\mathcal{S}_{-} \stackrel{\text{def}}{=} \{(\mathbf{x}, y)\, |\, (\mathbf{x}, y) \in \mathcal{S}, x^{(j)} < t\} and 𝒮+=def{(𝐱,y)|(𝐱,y)𝑆,x(j)t}\mathcal{S}_{+} \stackrel{\text{def}}{=} \{(\mathbf{x}, y)\, |\, (\mathbf{x}, y) \in \mathit{S}, x^{(j)} \geq t\}. The two new subsets would go to two new leaf nodes, and we evaluate, for all possible pairs (j,t)(j, t) how good the split with pieces 𝒮\mathcal{S}_{-} and 𝒮+\mathcal{S}_{+} is. Finally, we pick the best such values (j,t)(j, t), split 𝒮\mathcal{S} into 𝒮+\mathcal{S}_{+} and 𝒮\mathcal{S}_{-}, form two new leaf nodes, and continue recursively on 𝒮+\mathcal{S}_{+} and 𝒮\mathcal{S}_{-} (or quit if no split produces a model that’s sufficiently better than the current one). A decision tree after one split is illustrated in fig. 11.

现在你应该想知道“评估分裂有多好”这句话是什么意思。在 ID3 中,分割的优劣是通过使用称为的标准来估计的。熵是随机变量不确定性的度量。当随机变量的所有值均等概率时,它达到最大值。当随机变量只能有一个值时,熵达到最小值。一组例子的熵𝒮\mathcal{S}是(谁)给的,

Now you should wonder what do the words “evaluate how good the split is” mean. In ID3, the goodness of a split is estimated by using the criterion called entropy. Entropy is a measure of uncertainty about a random variable. It reaches its maximum when all values of the random variables are equiprobable. Entropy reaches its minimum when the random variable can have only one value. The entropy of a set of examples 𝒮\mathcal{S} is given by,

H𝒮=定义-FD3𝒮FD3𝒮-1-FD3𝒮1-FD3𝒮12 \begin{aligned}H(\mathcal{S}) &\stackrel{\text{def}}{=} - f^{\mathcal{S}}_{ID3} \ln f^{\mathcal{S}}_{ID3} \\ &\,\,\,\,\,\,\,\,- (1-f^{\mathcal{S}}_{ID3}) \ln(1-f^{\mathcal{S}}_{ID3}).\end{aligned}\qquad(12)

H(𝒮)=deffID3𝒮lnfID3𝒮(1fID3𝒮)ln(1fID3𝒮).(12) \begin{aligned}H(\mathcal{S}) &\stackrel{\text{def}}{=} - f^{\mathcal{S}}_{ID3} \ln f^{\mathcal{S}}_{ID3} \\ &\,\,\,\,\,\,\,\,- (1-f^{\mathcal{S}}_{ID3}) \ln(1-f^{\mathcal{S}}_{ID3}).\end{aligned}\qquad(12)

当我们按某个特征拆分一组示例时jj和一个阈值tt,分裂的熵,H𝒮-,𝒮+H(\mathcal{S}_{-}, \mathcal{S}_{+}),只是两个熵的加权和:

When we split a set of examples by a certain feature jj and a threshold tt, the entropy of a split, H(𝒮,𝒮+)H(\mathcal{S}_{-}, \mathcal{S}_{+}), is simply a weighted sum of two entropies:

H𝒮-,𝒮+=定义|𝒮-||𝒮|H𝒮-+|𝒮+||𝒮|H𝒮+13 \begin{aligned}H(\mathcal{S}_{-}, \mathcal{S}_{+}) &\stackrel{\text{def}}{=} \frac{|\mathcal{S}_{-}|}{|\mathcal{S}|} H(\mathcal{S}_{-}) \\ &\,\,\,\,\,\,+ \frac{|\mathcal{S}_{+}|}{|\mathcal{S}|} H(\mathcal{S}_{+}).\end{aligned} \qquad(13)

H(𝒮,𝒮+)=def|𝒮||𝒮|H(𝒮)+|𝒮+||𝒮|H(𝒮+).(13) \begin{aligned}H(\mathcal{S}_{-}, \mathcal{S}_{+}) &\stackrel{\text{def}}{=} \frac{|\mathcal{S}_{-}|}{|\mathcal{S}|} H(\mathcal{S}_{-}) \\ &\,\,\,\,\,\,+ \frac{|\mathcal{S}_{+}|}{|\mathcal{S}|} H(\mathcal{S}_{+}).\end{aligned} \qquad(13)

因此,在 ID3 中,在每一步,在每个叶节点,我们找到一个分裂,使等式给出的熵最小化。  13或者我们停在这个叶节点。

So, in ID3, at each step, at each leaf node, we find a split that minimizes the entropy given by eq. 13 or we stop at this leaf node.

在以下任一情况下,算法将停止在叶节点:

The algorithm stops at a leaf node in any of the below situations:

  • 叶节点中的所有示例都通过一件式模型正确分类(等式 11)。
  • All examples in the leaf node are classified correctly by the one-piece model (eq. 11).
  • 我们找不到要分割的属性。
  • We cannot find an attribute to split upon.
  • 分裂减少的熵小于一些ε\epsilon(必须通过实验找到其值3)。
  • The split reduces the entropy less than some ϵ\epsilon (the value for which has to be found experimentally3).
  • 树达到某个最大深度dd(也必须通过实验发现)。
  • The tree reaches some maximum depth dd (also has to be found experimentally).

因为在 ID3 中,每次迭代时分割数据集的决定是局部的(不依赖于未来的分割),所以该算法不能保证最佳解决方案。可以通过在搜索最佳决策树期间使用回溯等技术来改进模型,但代价是可能需要更长的时间来构建模型。

Because in ID3, the decision to split the dataset on each iteration is local (doesn’t depend on future splits), the algorithm doesn’t guarantee an optimal solution. The model can be improved by using techniques like backtracking during the search for the optimal decision tree at the cost of possibly taking longer to build a model.

最广泛使用的决策树学习算法公式称为C4.5。与 ID3 相比,它有几个附加功能:

The most widely used formulation of a decision tree learning algorithm is called C4.5. It has several additional features as compared to ID3:

  • 它接受连续和离散特征;
  • it accepts both continuous and discrete features;
  • 它处理不完整的示例;
  • it handles incomplete examples;
  • 它通过使用称为“修剪”的自下而上技术来解决过度拟合问题。
  • it solves overfitting problem by using a bottom-up technique known as “pruning”.

修剪包括在树创建后返回,并通过用叶节点替换那些对减少错误贡献不够显着的分支来删除它们。

Pruning consists of going back through the tree once it’s been created and removing branches that don’t contribute significantly enough to the error reduction by replacing them with leaf nodes.

基于熵的分割标准直观上是有意义的:熵达到最小值00当所有的例子都在𝒮\mathcal{S}具有相同的标签;另一方面,熵达到最大值11当恰好有一半的例子𝒮\mathcal{S}被标记为11,使得这样的叶子对于分类毫无用处。唯一剩下的问题是该算法如何近似最大化平均对数似然标准。我将其留待进一步阅读。

The entropy-based split criterion intuitively makes sense: entropy reaches its minimum of 00 when all examples in 𝒮\mathcal{S} have the same label; on the other hand, the entropy is at its maximum of 11 when exactly one-half of examples in 𝒮\mathcal{S} is labeled with 11, making such a leaf useless for classification. The only remaining question is how this algorithm approximately maximizes the average log-likelihood criterion. I leave it for further reading.

3.4支持向量机

3.4 Support Vector Machine

我已经在简介中介绍了 SVM,因此本节仅填补几个空白。需要回答两个关键问题:

I already presented SVM in the introduction, so this section only fills a couple of blanks. Two critical questions need to be answered:

  1. 如果数据中存在噪声并且没有超平面可以完美区分正例和负例怎么办?
  2. What if there’s noise in the data and no hyperplane can perfectly separate positive examples from negative ones?
  3. 如果数据无法使用平面分离,但可以通过高阶多项式分离怎么办?
  4. What if the data cannot be separated using a plane, but could be separated by a higher-order polynomial?
图 12:线性不可分离的情况:存在噪声。
图 12:线性不可分离的情况:存在噪声。
图 13:线性不可分离的情况:固有的非线性
图 13:线性不可分离的情况:固有的非线性

您可以看到图 1 中描述的两种情况。  12和图12  13 .在左侧的情况下,如果没有噪声(异常值或带有错误标签的示例),数据可以用直线分隔。在正确的情况下,决策边界是一个圆而不是一条直线。

You can see both situations depicted in fig. 12 and fig. 13. In the left case, the data could be separated by a straight line if not for the noise (outliers or examples with wrong labels). In the right case, the decision boundary is a circle and not a straight line.

请记住,在 SVM 中,我们希望满足以下约束:

Remember that in SVM, we want to satisfy the following constraints:

𝐰𝐱-+1如果y=+1,𝐰𝐱--1如果y=-114 \begin{aligned} &\mathbf{w}\mathbf{x}_i-b \geq +1 && \text{if}\ y_i = +1, \\ &\mathbf{w}\mathbf{x}_i-b \leq -1 && \text{if}\ y_i = -1. \end{aligned} \qquad(14)

𝐰𝐱ib+1ifyi=+1,𝐰𝐱ib1ifyi=1.(14) \begin{aligned} &\mathbf{w}\mathbf{x}_i-b \geq +1 && \text{if}\ y_i = +1, \\ &\mathbf{w}\mathbf{x}_i-b \leq -1 && \text{if}\ y_i = -1. \end{aligned} \qquad(14)

我们也想尽量减少𝐰\|\mathbf{w}\|使得超平面与每个类中最近的示例的距离相等。最小化𝐰\|\mathbf{w}\|相当于最小化12||𝐰||2\frac{1}{2}||\mathbf{w}||^2,并且该术语的使用使得稍后执行二次规划优化成为可能。因此,SVM 的优化问题如下所示:

We also want to minimize 𝐰\|\mathbf{w}\| so that the hyperplane is equally distant from the closest examples of each class. Minimizing 𝐰\|\mathbf{w}\| is equivalent to minimizing 12||𝐰||2\frac{1}{2}||\mathbf{w}||^2, and the use of this term makes it possible to perform quadratic programming optimization later on. The optimization problem for SVM, therefore, looks like this:

分钟12||𝐰||2这样:y𝐱𝐰--10,=1,……,15 \begin{split}\min \frac{1}{2}||\mathbf{w}||^2\,\,\,\textrm{such that:}\\ y_i(\mathbf{x}_i \mathbf{w} - b) - 1 \geq 0,\, i = 1,\ldots,N.\end{split} \qquad(15)

min12||𝐰||2such that:yi(𝐱i𝐰b)10,i=1,,N.(15) \begin{split}\min \frac{1}{2}||\mathbf{w}||^2\,\,\,\textrm{such that:}\\ y_i(\mathbf{x}_i \mathbf{w} - b) - 1 \geq 0,\, i = 1,\ldots,N.\end{split} \qquad(15)

3.4.1处理噪声

3.4.1 Dealing with Noise

为了将 SVM 扩展到数据不可线性分离的情况,我们引入了铰链损失函数:最大限度0,1-y𝐰𝐱-\max \left(0,1-y_i(\mathbf{w}\mathbf{x}_i-b)\right)

To extend SVM to cases in which the data is not linearly separable, we introduce the hinge loss function: max(0,1yi(𝐰𝐱ib))\max \left(0,1-y_i(\mathbf{w}\mathbf{x}_i-b)\right).

如果满足 中的约束,铰链损失函数为零;换句话说,如果𝐰𝐱\mathbf{w}\mathbf{x}_i位于决策边界的正确一侧。对于决策边界错误一侧的数据,函数的值与距决策边界的距离成正比。

The hinge loss function is zero if the constraints in are satisfied; in other words, if 𝐰𝐱i\mathbf{w}\mathbf{x}_i lies on the correct side of the decision boundary. For data on the wrong side of the decision boundary, the function’s value is proportional to the distance from the decision boundary.

然后我们希望最小化以下成本函数,

We then wish to minimize the following cost function,

C𝐰2+1Σ=1最大限度0,1-y𝐰𝐱-, C\lVert \mathbf{w} \rVert^2 + {\frac {1}{N}}\sum_{i=1}^N \max \left(0,1-y_i(\mathbf{w}\mathbf{x}_i-b)\right),

C𝐰2+1Ni=1Nmax(0,1yi(𝐰𝐱ib)), C\lVert \mathbf{w} \rVert^2 + {\frac {1}{N}}\sum_{i=1}^N \max \left(0,1-y_i(\mathbf{w}\mathbf{x}_i-b)\right),

其中超参数CC确定增加决策边界的大小和确保每个决策边界之间的权衡𝐱\mathbf{x}_i位于决策边界的正确一侧。的价值CC通常是通过实验选择的,就像 ID3 的超参数一样ε\epsilondd。优化铰链损失的 SVM 称为软边缘SVM,而原始公式称为硬边缘SVM。

where the hyperparameter CC determines the tradeoff between increasing the size of the decision boundary and ensuring that each 𝐱i\mathbf{x}_i lies on the correct side of the decision boundary. The value of CC is usually chosen experimentally, just like ID3’s hyperparameters ϵ\epsilon and dd. SVMs that optimize hinge loss are called soft-margin SVMs, while the original formulation is referred to as a hard-margin SVM.

正如您所看到的,对于足够高的值CC,成本函数中的第二项将变得可以忽略不计,因此 SVM 算法将尝试通过完全忽略错误分类来找到最高边际。当我们减少CC,分类错误的成本变得越来越高,因此 SVM 算法试图通过牺牲边距大小来减少错误。正如我们已经讨论过的,更大的余量更有利于泛化。所以,CC调节对训练数据进行良好分类(最小化经验风险)和对未来示例进行良好分类(泛化)之间的权衡。

As you can see, for sufficiently high values of CC, the second term in the cost function will become negligible, so the SVM algorithm will try to find the highest margin by completely ignoring misclassification. As we decrease the value of CC, making classification errors is becoming more costly, so the SVM algorithm tries to make fewer mistakes by sacrificing the margin size. As we have already discussed, a larger margin is better for generalization. Therefore, CC regulates the tradeoff between classifying the training data well (minimizing empirical risk) and classifying future examples well (generalization).

3.4.2处理固有的非线性

3.4.2 Dealing with Inherent Non-Linearity

SVM 可以适用于处理原始空间中无法被超平面分隔的数据集。事实上,如果我们设法将原始空间转换为更高维度的空间,我们可以希望这些例子在这个转换后的空间中变得线性可分。在SVM中,在成本函数优化过程中使用函数将原始空间隐式变换到更高维空间称为核技巧

SVM can be adapted to work with datasets that cannot be separated by a hyperplane in its original space. Indeed, if we manage to transform the original space into a space of higher dimensionality, we could hope that the examples will become linearly separable in this transformed space. In SVMs, using a function to implicitly transform the original space into a higher dimensional space during the cost function optimization is called the kernel trick.

图 14:图 14 中的数据13在变换到三维空间后变得线性可分。
图 14:图 14 中的数据 13在变换到三维空间后变得线性可分。

应用核技巧的效果如图 2 所示。  14 .正如您所看到的,可以使用特定的映射将二维非线性可分离数据转换为线性可分离三维数据φ:𝐱φ𝐱\phi: \mathbf{x} \mapsto \phi(\mathbf{x}), 在哪里φ𝐱\phi(\mathbf{x})是一个比维数更高的向量𝐱\mathbf{x}。以图 2D 数据为例。  13)、映射φ\phi对于该项目,有一个 2D 示例𝐱=[q,p]\mathbf{x} = [q,p]进入 3D 空间(图 14)将如下所示:φ[q,p]=定义q2,2qp,p2\phi([q, p]) \stackrel{\text{def}}{=} (q^2, \sqrt{2} q p, p^2), 在哪里2\cdot^2方法\cdot平方。您现在看到数据在变换后的空间中变得线性可分。

The effect of applying the kernel trick is illustrated in fig. 14. As you can see, it’s possible to transform a two-dimensional non-linearly-separable data into a linearly-separable three-dimensional data using a specific mapping ϕ:𝐱ϕ(𝐱)\phi: \mathbf{x} \mapsto \phi(\mathbf{x}), where ϕ(𝐱)\phi(\mathbf{x}) is a vector of higher dimensionality than 𝐱\mathbf{x}. For the example of 2D data in fig. 13), the mapping ϕ\phi for that projects a 2D example 𝐱=[q,p]\mathbf{x} = [q,p] into a 3D space (fig. 14) would look like this: ϕ([q,p])=def(q2,2qp,p2)\phi([q, p]) \stackrel{\text{def}}{=} (q^2, \sqrt{2} q p, p^2), where 2\cdot^2 means \cdot squared. You see now that the data becomes linearly separable in the transformed space.

然而,我们不知道先验哪个映射φ\phi对我们的数据有用。如果我们首先使用某种映射将所有输入示例转换为非常高维的向量,然后将 SVM 应用于该数据,并尝试所有可能的映射函数,则计算可能会变得非常低效,并且我们永远无法解决分类问题。

However, we don’t know a priori which mapping ϕ\phi would work for our data. If we first transform all our input examples using some mapping into very high dimensional vectors and then apply SVM to this data, and we try all possible mapping functions, the computation could become very inefficient, and we would never solve our classification problem.

幸运的是,科学家们弄清楚了如何使用核函数(或者简单地说,)在高维空间中有效地工作,而无需显式地进行这种转换。要了解核的工作原理,我们必须首先了解 SVM 的优化算法如何找到最优值𝐰\mathbf{w}b

Fortunately, scientists figured out how to use kernel functions (or, simply, kernels) to efficiently work in higher-dimensional spaces without doing this transformation explicitly. To understand how kernels work, we have to see first how the optimization algorithm for SVM finds the optimal values for 𝐰\mathbf{w} and bb.

传统上用于解决方程中的优化问题的方法。  15拉格朗日乘子法。而不是从等式解决原始问题。  15,解决这样的等价问题很方便:

The method traditionally used to solve the optimization problem in eq. 15 is the method of Lagrange multipliers. Instead of solving the original problem from eq. 15, it is convenient to solve an equivalent problem formulated like this:

最大限度α1……α[Σ=1α-12Σ=1Σk=1yα𝐱𝐱kykαk]Σ=1αy=0α0,=1,……,, \begin{split}\max_{\alpha_1 \ldots \alpha_N} \Big[ \sum_{i=1}^{N}\alpha_{i} \\ - {\frac {1}{2}}\sum_{i=1}^{N}\sum_{k=1}^{N}y_i \alpha_i(\mathbf{x}_i \mathbf{x}_k)y_k \alpha_k\Big]\\ \text{subject to } \sum_{i=1}^{N}\alpha_i y_i=0\\ \text{and } \alpha_i \geq 0, i = 1, \ldots, N,\end{split}

maxα1αN[i=1Nαi12i=1Nk=1Nyiαi(𝐱i𝐱k)ykαk]subject to i=1Nαiyi=0and αi0,i=1,,N, \begin{split}\max_{\alpha_1 \ldots \alpha_N} \Big[ \sum_{i=1}^{N}\alpha_{i} \\ - {\frac {1}{2}}\sum_{i=1}^{N}\sum_{k=1}^{N}y_i \alpha_i(\mathbf{x}_i \mathbf{x}_k)y_k \alpha_k\Big]\\ \text{subject to } \sum_{i=1}^{N}\alpha_i y_i=0\\ \text{and } \alpha_i \geq 0, i = 1, \ldots, N,\end{split}

在哪里α\alpha_i称为拉格朗日乘子。当这样表达时,优化问题就变成了凸二次优化问题,可以通过二次规划算法有效求解。

where αi\alpha_i are called Lagrange multipliers. When formulated like this, the optimization problem becomes a convex quadratic optimization problem, efficiently solvable by quadratic programming algorithms.

现在,您可能已经注意到,在上面的公式中,有一个术语𝐱𝐱k\mathbf{x}_i \mathbf{x}_k,这是唯一使用特征向量的地方。如果我们想将向量空间变换到高维空间,我们需要变换𝐱\mathbf{x}_i进入φ𝐱\phi(\mathbf{x}_i)𝐱k\mathbf{x}_k进入φ𝐱k\phi(\mathbf{x}_k)然后乘以φ𝐱\phi(\mathbf{x}_i)φ𝐱k\phi(\mathbf{x}_k)。这样做的成本会非常高。

Now, you could have noticed that in the above formulation, there is a term 𝐱i𝐱k\mathbf{x}_i \mathbf{x}_k, and this is the only place where the feature vectors are used. If we want to transform our vector space into higher dimensional space, we need to transform 𝐱i\mathbf{x}_i into ϕ(𝐱i)\phi(\mathbf{x}_i) and 𝐱k\mathbf{x}_k into ϕ(𝐱k)\phi(\mathbf{x}_k) and then multiply ϕ(𝐱i)\phi(\mathbf{x}_i) and ϕ(𝐱k)\phi(\mathbf{x}_k). Doing so would be very costly.

另一方面,我们只对点积的结果感兴趣𝐱𝐱k\mathbf{x}_i\mathbf{x}_k,正如我们所知,它是一个实数。我们不在乎这个数字是如何获得的,只要它是正确的。通过使用核技巧,我们可以摆脱将原始特征向量转换为高维向量的昂贵转换,并避免计算它们的点积。我们通过对原始特征向量进行简单的操作来替换它,得到相同的结果。例如,不是变换q1,p1(q_1, p_1)进入q12,2q1p1,p12(q_1^2, \sqrt{2} q_1 p_1, p_1^2)q2,p2(q_2, p_2)进入q22,2q2p2,p22(q_2^2, \sqrt{2} q_2 p_2, p_2^2)然后计算点积q12,2q1p1,p12(q_1^2, \sqrt{2} q_1 p_1, p_1^2)q22,2q2p2,p22(q_2^2, \sqrt{2} q_2 p_2, p_2^2)获得q12q22+2q1q2p1p2+p12p22(q_1^2q_2^2 + 2 q_1 q_2 p_1 p_2 + p_1^2 p_2^2)我们可以找到之间的点积q1,p1(q_1, p_1)q2,p2(q_2, p_2)要得到q1q2+p1p2(q_1 q_2 + p_1 p_2)然后对其进行平方以获得完全相同的结果q12q22+2q1q2p1p2+p12p22(q_1^2q_2^2 + 2 q_1 q_2 p_1 p_2 + p_1^2 p_2^2)

On the other hand, we are only interested in the result of the dot-product 𝐱i𝐱k\mathbf{x}_i\mathbf{x}_k, which, as we know, is a real number. We don’t care how this number was obtained as long as it’s correct. By using the kernel trick, we can get rid of a costly transformation of original feature vectors into higher-dimensional vectors and avoid computing their dot-product. We replace that by a simple operation on the original feature vectors that gives the same result. For example, instead of transforming (q1,p1)(q_1, p_1) into (q12,2q1p1,p12)(q_1^2, \sqrt{2} q_1 p_1, p_1^2) and (q2,p2)(q_2, p_2) into (q22,2q2p2,p22)(q_2^2, \sqrt{2} q_2 p_2, p_2^2) and then computing the dot-product of (q12,2q1p1,p12)(q_1^2, \sqrt{2} q_1 p_1, p_1^2) and (q22,2q2p2,p22)(q_2^2, \sqrt{2} q_2 p_2, p_2^2) to obtain (q12q22+2q1q2p1p2+p12p22)(q_1^2q_2^2 + 2 q_1 q_2 p_1 p_2 + p_1^2 p_2^2) we could find the dot-product between (q1,p1)(q_1, p_1) and (q2,p2)(q_2, p_2) to get (q1q2+p1p2)(q_1 q_2 + p_1 p_2) and then square it to get exactly the same result (q12q22+2q1q2p1p2+p12p22)(q_1^2q_2^2 + 2 q_1 q_2 p_1 p_2 + p_1^2 p_2^2).

这是核技巧的一个例子,我们使用了二次核k𝐱,𝐱k=定义𝐱𝐱k2k(\mathbf{x}_i, \mathbf{x}_k) \stackrel{\text{def}}{=} (\mathbf{x}_i \mathbf{x}_k)^2。存在多种核函数,其中最广泛使用的是RBF 核

That was an example of the kernel trick, and we used the quadratic kernel k(𝐱i,𝐱k)=def(𝐱i𝐱k)2k(\mathbf{x}_i, \mathbf{x}_k) \stackrel{\text{def}}{=} (\mathbf{x}_i \mathbf{x}_k)^2. Multiple kernel functions exist, the most widely used of which is the RBF kernel:

k𝐱,𝐱=经验值-𝐱-𝐱22σ2, k(\mathbf{x},\mathbf{x'})= \exp \left(-{\frac {\|\mathbf{x}-\mathbf{x'}\|^2}{2 \sigma^2}}\right), 在哪里𝐱-𝐱2\|\mathbf {x} -\mathbf {x'} \|^2是两个特征向量之间的欧氏距离平方。欧几里得距离由以下等式给出:

k(𝐱,𝐱)=exp(𝐱𝐱22σ2), k(\mathbf{x},\mathbf{x'})= \exp \left(-{\frac {\|\mathbf{x}-\mathbf{x'}\|^2}{2 \sigma^2}}\right), where 𝐱𝐱2\|\mathbf {x} -\mathbf {x'} \|^2 is the squared Euclidean distance between two feature vectors. The Euclidean distance is given by the following equation:

d𝐱,𝐱k=定义X1-Xk12+X2-Xk22++X-Xk2=Σj=1DXj-Xkj2 \begin{aligned}d(\mathbf{x}_i,\mathbf{x}_k) &\stackrel{\text{def}}{=} \sqrt{\begin{split}\left(x_i^{(1)} - x_k^{(1)}\right)^2 \\ + \left(x_i^{(2)} - x_k^{(2)}\right)^2 + \cdots \\ + \left(x_i^{(N)}-x_k^{(N)}\right)^2\end{split}} \\ &= \sqrt{\sum_{j=1}^{D}\left(x_i^{(j)}-x_k^{(j)}\right)^2}.\end{aligned}

d(𝐱i,𝐱k)=def(xi(1)xk(1))2+(xi(2)xk(2))2++(xi(N)xk(N))2=j=1D(xi(j)xk(j))2. \begin{aligned}d(\mathbf{x}_i,\mathbf{x}_k) &\stackrel{\text{def}}{=} \sqrt{\begin{split}\left(x_i^{(1)} - x_k^{(1)}\right)^2 \\ + \left(x_i^{(2)} - x_k^{(2)}\right)^2 + \cdots \\ + \left(x_i^{(N)}-x_k^{(N)}\right)^2\end{split}} \\ &= \sqrt{\sum_{j=1}^{D}\left(x_i^{(j)}-x_k^{(j)}\right)^2}.\end{aligned}

可以证明,RBF(“径向基函数”)核的特征空间具有无限维数。通过改变超参数σ\sigma,数据分析师可以选择在原始空间中获得平滑或弯曲的决策边界。

It can be shown that the feature space of the RBF (for “radial basis function”) kernel has an infinite number of dimensions. By varying the hyperparameter σ\sigma, the data analyst can choose between getting a smooth or curvy decision boundary in the original space.

3.5 k-最近邻

3.5 k-Nearest Neighbors

k 最近邻(kNN) 是一种非参数学习算法。与其他允许在模型构建后丢弃训练数据的学习算法相反,kNN 将所有训练示例保留在内存中。曾经是一个新的、前所未见的例子𝐱\mathbf{x}进来后,kNN算法发现kk最接近的训练示例𝐱\mathbf{x}如果是分类,则返回多数标签;如果是回归,则返回平均标签。

k-Nearest Neighbors (kNN) is a non-parametric learning algorithm. Contrary to other learning algorithms that allow discarding the training data after the model is built, kNN keeps all training examples in memory. Once a new, previously unseen example 𝐱\mathbf{x} comes in, the kNN algorithm finds kk training examples closest to 𝐱\mathbf{x} and returns the majority label, in case of classification, or the average label, in case of regression.

两个例子的接近度由距离函数给出。例如,上面看到的欧几里德距离在实践中经常使用。距离函数的另一个流行选择是负余弦相似度。余弦相似度定义为,

The closeness of two examples is given by a distance function. For example, Euclidean distance seen above is frequently used in practice. Another popular choice of the distance function is the negative cosine similarity. Cosine similarity defined as,

s𝐱,𝐱k=定义因斯𝐱,𝐱k=Σj=1DXjXkjΣj=1DXj2Σj=1DXkj2, \begin{aligned}s(\mathbf{x}_i,\mathbf{x}_k) &\stackrel{\text{def}}{=} \cos(\angle(\mathbf{x}_i,\mathbf{x}_k)) \\ &= \frac{\sum_{j=1}^{D}{x_i^{(j)} x_k^{(j)}}}{\sqrt{\sum_{j=1}^D \left(x_i^{(j)}\right)^2} \sqrt{\sum_{j=1}^D \left(x_k^{(j)}\right)^2}},\end{aligned}

s(𝐱i,𝐱k)=defcos((𝐱i,𝐱k))=j=1Dxi(j)xk(j)j=1D(xi(j))2j=1D(xk(j))2, \begin{aligned}s(\mathbf{x}_i,\mathbf{x}_k) &\stackrel{\text{def}}{=} \cos(\angle(\mathbf{x}_i,\mathbf{x}_k)) \\ &= \frac{\sum_{j=1}^{D}{x_i^{(j)} x_k^{(j)}}}{\sqrt{\sum_{j=1}^D \left(x_i^{(j)}\right)^2} \sqrt{\sum_{j=1}^D \left(x_k^{(j)}\right)^2}},\end{aligned}

是两个向量方向相似度的度量。如果两个向量之间的角度是00度,则两个向量指向同一方向,余弦相似度等于11。如果向量正交,则余弦相似度为00。对于指向相反方向的向量,余弦相似度为-1-1。如果我们想使用余弦相似度作为距离度量,我们需要将其乘以-1-1。其他流行的距离度量包括切比雪夫距离、马哈拉诺比斯距离和汉明距离。距离度量的选择以及值kk,是分析师在运行算法之前做出的选择。所以这些是超参数。距离度量也可以从数据中学习(而不是猜测)。我们将在第 10 章中讨论这一点。

is a measure of similarity of the directions of two vectors. If the angle between two vectors is 00 degrees, then two vectors point to the same direction, and cosine similarity is equal to 11. If the vectors are orthogonal, the cosine similarity is 00. For vectors pointing in opposite directions, the cosine similarity is 1-1. If we want to use cosine similarity as a distance metric, we need to multiply it by 1-1. Other popular distance metrics include Chebychev distance, Mahalanobis distance, and Hamming distance. The choice of the distance metric, as well as the value for kk, are the choices the analyst makes before running the algorithm. So these are hyperparameters. The distance metric could also be learned from data (as opposed to guessing it). We talk about that in Chapter 10.


  1. 这么说yy_i是实值,我们写yεy_i \in \mathbb{R}, 在哪里\mathbb{R}表示所有实数的集合,即从负无穷大到正无穷大的无限数集合。

  2. To say that yiy_i is real-valued, we write yiy_i \in \mathbb{R}, where \mathbb{R} denotes the set of all real numbers, an infinite set of numbers from minus infinity to plus infinity.

  3. 为了找到函数的最小值或最大值,我们将梯度设置为零,因为函数极值处的梯度值始终为零。在二维中,极值处的梯度是一条水平线。

  4. To find the minimum or the maximum of a function, we set the gradient to zero because the value of the gradient at extrema of a function is always zero. In 2D, the gradient at an extremum is a horizontal line.

  5. 在第 5 章中,我将在超参数调整部分展示如何做到这一点。

  6. In Chapter 5, I show how to do that in the section on hyperparameter tuning.

4学习算法剖析

4 Anatomy of a Learning Algorithm

4.1学习算法的构建模块

4.1 Building Blocks of a Learning Algorithm

通过阅读前一章,您可能已经注意到,我们看到的每个学习算法都由三个部分组成:

You may have noticed by reading the previous chapter that each learning algorithm we saw consisted of three parts:

  1. 损失函数;
  2. a loss function;
  3. 基于损失函数(例如成本函数)的优化标准;和
  4. an optimization criterion based on the loss function (a cost function, for example); and
  5. 利用训练数据来寻找优化标准的解决方案的优化例程。
  6. an optimization routine leveraging training data to find a solution to the optimization criterion.

这些是任何学习算法的构建块。您在上一章中看到,一些算法被设计为显式优化特定标准(线性回归和逻辑回归,SVM)。其他一些,包括决策树学习和 kNN,隐式优化标准。决策树学习和 kNN 是最古老的机器学习算法之一,是基于直觉通过实验发明的,没有考虑特定的全局优化标准,并且(就像科学史上经常发生的那样)优化标准是后来开发的,以解释为什么这些算法工作。

These are the building blocks of any learning algorithm. You saw in the previous chapter that some algorithms were designed to explicitly optimize a specific criterion (both linear and logistic regressions, SVM). Some others, including decision tree learning and kNN, optimize the criterion implicitly. Decision tree learning and kNN are among the oldest machine learning algorithms and were invented experimentally based on intuition, without a specific global optimization criterion in mind, and (like it often happened in scientific history) the optimization criteria were developed later to explain why those algorithms work.

通过阅读有关机器学习的现代文献,您经常会遇到梯度下降随机梯度下降的参考。这是两种最常用的优化算法,用于优化标准可微分的情况。

By reading modern literature on machine learning, you often encounter references to gradient descent or stochastic gradient descent. These are two most frequently used optimization algorithms used in cases where the optimization criterion is differentiable.

梯度下降是一种用于寻找函数最小值的迭代优化算法。为了使用梯度下降找到函数的局部最小值,我们从某个随机点开始,并采取与当前点处函数的梯度(或近似梯度)的负值成比例的步骤。

Gradient descent is an iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one starts at some random point and takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

梯度下降可用于寻找线性回归、逻辑回归、SVM 以及我们稍后考虑的神经网络的最佳参数。对于许多模型,例如逻辑回归或 SVM,优化标准是凸的。凸函数只有一个最小值,即全局的。神经网络的优化标准不是凸的,但在实践中甚至找到局部最小值就足够了。

Gradient descent can be used to find optimal parameters for linear and logistic regression, SVM and also neural networks which we consider later. For many models, such as logistic regression or SVM, the optimization criterion is convex. Convex functions have only one minimum, which is global. Optimization criteria for neural networks are not convex, but in practice even finding a local minimum suffices.

让我们看看梯度下降是如何工作的。

Let’s see how gradient descent works.

4.2梯度下降

4.2 Gradient Descent

在本节中,我将演示梯度下降如何找到线性回归问题1的解决方案。我用 Python 代码以及图表来说明我的描述,这些图表显示了在梯度下降的一些迭代之后解决方案如何改进。我使用的数据集只有一个特征。然而,优化标准将有两个参数:wwb。多维训练数据的扩展很简单:你有变量w1w^{(1)},w2w^{(2)}, 和b对于二维数据,w1w^{(1)},w2w^{(2)},w3w^{(3)}, 和b用于三维数据等。

In this section, I demonstrate how gradient descent finds the solution to a linear regression problem1. I illustrate my description with Python code as well as with plots that show how the solution improves after some iterations of gradient descent. I use a dataset with only one feature. However, the optimization criterion will have two parameters: ww and bb. The extension to multi-dimensional training data is straightforward: you have variables w(1)w^{(1)}, w(2)w^{(2)}, and bb for two-dimensional data, w(1)w^{(1)}, w(2)w^{(2)}, w(3)w^{(3)}, and bb for three-dimensional data and so on.

图 15:原始数据。 Y 轴对应于单位销售额(我们想要预测的数量),X 轴对应于我们的特征:广播广告的支出(以百万美元为单位)。
图 15:原始数据。 Y 轴对应于单位销售额(我们想要预测的数量),X 轴对应于我们的特征:广播广告的支出(以百万美元为单位)。

为了举一个实际的例子,我使用了真实的数据集(可以在本书的维基百科上找到)以及以下列:各公司每年在广播广告上的支出以及以销售量计算的年度销售额。我们希望建立一个回归模型,可以根据公司在广播广告上的支出来预测销量。数据集中的每一行代表一个特定的公司:

To give a practical example, I use the real dataset (can be found on the book’s wiki) with the following columns: the Spendings of various companies on radio advertising each year and their annual Sales in terms of units sold. We want to build a regression model that we can use to predict units sold based on how much a company spends on radio advertising. Each row in the dataset represents one specific company:

公司 支出,百万元 销量,单位
1 37.8 22.1
2 39.3 10.4
3 45.9 9.3
4 41.3 18.5
.. .. ..

我们有 200 家公司的数据,因此我们有 200 个以下形式的训练示例X,y=SpendnGs,SAes(x_i, y_i) = (Spendings_i, Sales_i)。在图中。  15,您可以在二维图上查看所有示例。

We have data for 200 companies, so we have 200 training examples in the form (xi,yi)=(Spendingsi,Salesi)(x_i, y_i) = (Spendings_i, Sales_i). In fig. 15, you can see all examples on a 2D plot.

请记住,线性回归模型如下所示:FX=wX+f(x) = wx + b。我们不知道最佳值是多少wwb是,我们想从数据中学习它们。为此,我们寻找这样的值wwb最小化均方误差:

Remember that the linear regression model looks like this: f(x)=wx+bf(x) = wx + b. We don’t know what the optimal values for ww and bb are and we want to learn them from data. To do that, we look for such values for ww and bb that minimize the mean squared error:

=定义1Σ=1y-wX+2 l \stackrel{\text{def}}{=} \frac{1}{N}\sum_{i=1}^{N}(y_i - (wx_i + b))^2.

l=def1Ni=1N(yi(wxi+b))2. l \stackrel{\text{def}}{=} \frac{1}{N}\sum_{i=1}^{N}(y_i - (wx_i + b))^2.

梯度下降从计算每个参数的偏导数开始:

Gradient descent starts with calculating the partial derivative for every parameter:

w=1Σ=1-2Xy-wX+;=1Σ=1-2y-wX+\begin{equation}\label{partial-derivatives} \begin{split} \frac{\partial l}{\partial w} &= \frac{1}{N} \sum_{i=1}^N -2x_i(y_i - (wx_i + b)); \\ \frac{\partial l}{\partial b} &= \frac{1}{N} \sum_{i=1}^N -2(y_i - (wx_i + b)). \end{split} \end{equation}

lw=1Ni=1N2xi(yi(wxi+b));lb=1Ni=1N2(yi(wxi+b)).\begin{equation}\label{partial-derivatives} \begin{split} \frac{\partial l}{\partial w} &= \frac{1}{N} \sum_{i=1}^N -2x_i(y_i - (wx_i + b)); \\ \frac{\partial l}{\partial b} &= \frac{1}{N} \sum_{i=1}^N -2(y_i - (wx_i + b)). \end{split} \end{equation}

求项的偏导数y-wX+2(y_i - (wx + b))^2关于ww我应用了链式法则。在这里,我们有链条F=F2F1f = f_2(f_1)在哪里F1=y-wX+f_1 = y_i - (wx + b)F2=F12f_2 = f_1^2。求一个偏导数Ff关于ww我们必须首先找到偏导数Ff关于F2f_2这等于2y-wX+2(y_i - (wx + b))(根据微积分,我们知道导数XX2=2X\frac{\partial}{\partial x}x^2 = 2x)然后我们必须将其乘以偏导数y-wX+y_i - (wx + b)关于ww这等于-X-x。所以总体来说w=1Σ=1-2Xy-wX+\frac{\partial l}{\partial w} = \frac{1}{N} \sum_{i=1}^N -2x_i(y_i - (wx_i + b))。类似地,求偏导数l关于b,\frac{\partial l}{\partial b}, 进行了计算。

To find the partial derivative of the term (yi(wx+b))2(y_i - (wx + b))^2 with respect to ww I applied the chain rule. Here, we have the chain f=f2(f1)f = f_2(f_1) where f1=yi(wx+b)f_1 = y_i - (wx + b) and f2=f12f_2 = f_1^2. To find a partial derivative of ff with respect to ww we have to first find the partial derivative of ff with respect to f2f_2 which is equal to 2(yi(wx+b))2(y_i - (wx + b)) (from calculus, we know that the derivative xx2=2x\frac{\partial}{\partial x}x^2 = 2x) and then we have to multiply it by the partial derivative of yi(wx+b)y_i - (wx + b) with respect to ww which is equal to x-x. So overall lw=1Ni=1N2xi(yi(wxi+b))\frac{\partial l}{\partial w} = \frac{1}{N} \sum_{i=1}^N -2x_i(y_i - (wx_i + b)). In a similar way, the partial derivative of ll with respect to bb, lb\frac{\partial l}{\partial b}, was calculated.

梯度下降以epoch为单位进行。一个时期包括完全使用训练集来更新每个参数。一开始,第一个纪元,我们初始化2 w0w \gets 00b \gets 0。偏导数,w\frac{\partial l}{\partial w}\frac{\partial l}{\partial b}由等式给出分别相等,-2Σ=1Xy\frac{-2}{N}\sum_{i=1}^N x_iy_i-2Σ=1y\frac{-2}{N} \sum_{i=1}^N y_i。在每个时期,我们都会更新wwb使用偏导数。学习 α\alpha控制更新的大小:

Gradient descent proceeds in epochs. An epoch consists of using the training set entirely to update each parameter. In the beginning, the first epoch, we initialize2 w0w \gets 0 and b0b \gets 0. The partial derivatives, lw\frac{\partial l}{\partial w} and lb\frac{\partial l}{\partial b} given by eq. equal, respectively, 2Ni=1Nxiyi\frac{-2}{N}\sum_{i=1}^N x_iy_i and 2Ni=1Nyi\frac{-2}{N} \sum_{i=1}^N y_i. At each epoch, we update ww and bb using partial derivatives. The learning rate α\alpha controls the size of an update:

ww-αw;-α\begin{equation} \begin{split} w &\gets w - \alpha\frac{\partial l}{\partial w}; \\ b & \gets b - \alpha\frac{\partial l}{\partial b}. \end{split} \end{equation}

wwαlw;bbαlb.\begin{equation} \begin{split} w &\gets w - \alpha\frac{\partial l}{\partial w}; \\ b & \gets b - \alpha\frac{\partial l}{\partial b}. \end{split} \end{equation}

我们从参数值中减去(而不是添加)偏导数,因为导数是函数增长的指标。如果导数在某个点3为正,则函数在该点增长。因为我们想要最小化目标函数,所以当导数为正时,我们知道我们需要向相反方向移动参数(向坐标轴的左侧)。当导数为负(函数递减)时,我们需要将参数向右移动以进一步减小函数的值。从参数中减去负值会将其移至右侧。

We subtract (as opposed to adding) partial derivatives from the values of parameters because derivatives are indicators of growth of a function. If a derivative is positive at some point3, then the function grows at this point. Because we want to minimize the objective function, when the derivative is positive we know that we need to move our parameter in the opposite direction (to the left on the axis of coordinates). When the derivative is negative (function is decreasing), we need to move our parameter to the right to decrease the value of the function even more. Subtracting a negative value from a parameter moves it to the right.

在下一个纪元,我们使用等式重新计算偏导数。更新后的值wwb;我们继续这个过程直到收敛。通常,我们需要很多纪元,直到我们开始看到wwb每个纪元之后都不会发生太大变化;然后我们停下来。

At the next epoch, we recalculate partial derivatives using eq. with the updated values of ww and bb; we continue the process until convergence. Typically, we need many epochs until we start seeing that the values for ww and bb don’t change much after each epoch; then we stop.

很难想象一个机器学习工程师不使用Python。因此,如果您正在等待开始学习 Python 的合适时机,那么现在就是时候了。下面,我展示了如何在 Python 中编写梯度下降程序。

It’s hard to imagine a machine learning engineer who doesn’t use Python. So, if you waited for the right moment to start learning Python, this is that moment. Below, I show how to program gradient descent in Python.

更新参数的函数wwb一个 epoch 期间的情况如下所示:

The function that updates the parameters ww and bb during one epoch is shown below:

循环多个纪元的函数如下所示:

The function that loops over multiple epochs is shown below:

图 16:回归线在梯度下降时期的演变。
图 16:回归线在梯度下降时期的演变。

上面代码片段中的函数avg_loss是计算均方误差的函数。它定义为:

The function avg_loss in the above code snippet is a function that computes the mean squared error. It is defined as:

如果我们运行训练函数α=0.001\alpha=0.001,w=0.0w = 0.0,=0.0b = 0.0,和 15,000 个 epoch,我们将看到以下输出(部分显示):

If we run the train function for α=0.001\alpha=0.001, w=0.0w = 0.0, b=0.0b = 0.0, and 15,000 epochs, we will see the following output (shown partially):

epoch:  0 loss: 92.32078294903626
epoch:  400 loss: 33.79131790081576
epoch:  800 loss: 27.9918542960729
epoch:  1200 loss: 24.33481690722147
epoch:  1600 loss: 22.028754937538633
...
epoch:  2800 loss: 19.07940244306619
epoch:  0 loss: 92.32078294903626
epoch:  400 loss: 33.79131790081576
epoch:  800 loss: 27.9918542960729
epoch:  1200 loss: 24.33481690722147
epoch:  1600 loss: 22.028754937538633
...
epoch:  2800 loss: 19.07940244306619

您可以看到,随着训练函数循环历元,平均损失会减少。在图中。  16您可以看到回归线在各个时期的演变。

You can see that the average loss decreases as the train function loops through epochs. In fig. 16 you can see the evolution of the regression line through epochs.

最后,一旦我们找到了参数的最佳值wwb,唯一缺少的部分是进行预测的函数:

Finally, once we have found the optimal values of parameters ww and bb, the only missing piece is a function that makes predictions:

尝试执行以下代码:

Try to execute the following code:

输出是13.9713.97

The output is 13.9713.97.

梯度下降对学习率的选择很敏感α\alpha。对于大型数据集来说它也很慢。幸运的是,已经对该算法提出了一些重大改进。

Gradient descent is sensitive to the choice of the learning rate α\alpha. It is also slow for large datasets. Fortunately, several significant improvements to this algorithm have been proposed.

小批量随机梯度下降(小批量 SGD)是该算法的一个版本,它通过使用训练数据的较小批量(子集)来近似梯度来加速计算。 SGD本身有各种“升级”。Adagrad是 SGD 的一个可扩展版本α\alpha根据梯度历史记录每个参数。因此,α\alpha对于非常大的梯度会减小,反之亦然。动量是一种通过将梯度下降定向到相关方向并减少振荡来帮助加速 SGD 的方法。在神经网络训练中,经常使用SGD 的变体,例如RMSpropAdam 。

Minibatch stochastic gradient descent (minibatch SGD) is a version of the algorithm that speeds up the computation by approximating the gradient using smaller batches (subsets) of the training data. SGD itself has various “upgrades”. Adagrad is a version of SGD that scales α\alpha for each parameter according to the history of gradients. As a result, α\alpha is reduced for very large gradients and vice-versa. Momentum is a method that helps accelerate SGD by orienting the gradient descent in the relevant direction and reducing oscillations. In neural network training, variants of SGD such as RMSprop and Adam, are very frequently used.

请注意,梯度下降及其变体不是机器学习算法。它们是最小化问题的求解器,其中要最小化的函数具有梯度(在其域的大多数点)。

Notice that gradient descent and its variants are not machine learning algorithms. They are solvers of minimization problems in which the function to minimize has a gradient (in most points of its domain).

4.3机器学习工程师如何工作

4.3 How Machine Learning Engineers Work

除非您是研究科学家或在拥有大量研发预算的大公司工作,否则您通常不会自己实现机器学习算法。您也没有实现梯度下降或其他求解器。您使用库,其中大部分都是开源的。库是算法和支持工具的集合,其实现时考虑到了稳定性和效率。实践中最常用的开源机器学习库是scikit-learn。它是用 Python 和 C 编写的。以下是在 scikit-learn 中进行线性回归的方法:

Unless you are a research scientist or work for a huge corporation with a large R&D budget, you usually don’t implement machine learning algorithms yourself. You don’t implement gradient descent or some other solver either. You use libraries, most of which are open source. A library is a collection of algorithms and supporting tools implemented with stability and efficiency in mind. The most frequently used in practice open-source machine learning library is scikit-learn. It’s written in Python and C. Here’s how you do linear regression in scikit-learn:

输出将再次是13.9713.97。容易,对吧?您可以将 LinearRegression 替换为其他类型的回归学习算法,而无需修改任何其他内容。它就是有效的。关于分类也是如此。您可以轻松地将LogisticRegression算法替换为SVC算法(这是 scikit-learn 对支持向量机算法的名称)、DecisionTreeClassifierNearestNeighbors或 scikit-learn 中实现的许多其他分类学习算法。

The output will, again, be 13.9713.97. Easy, right? You can replace LinearRegression with some other type of regression learning algorithm without modifying anything else. It just works. The same can be said about classification. You can easily replace LogisticRegression algorithm with SVC algorithm (this is scikit-learn’s name for the Support Vector Machine algorithm), DecisionTreeClassifier, NearestNeighbors or many other classification learning algorithms implemented in scikit-learn.

4.4学习算法的特殊性

4.4 Learning Algorithms’ Particularities

在这里,我概述了一些可以区分一种学习算法和另一种学习算法的实际特性。您已经知道不同的学习算法可以有不同的超参数(CC在支持向量机中,ε\epsilondd在 ID3 中)。诸如梯度下降之类的求解器也可以具有超参数,例如α\alpha例如。

Here, I outline some practical particularities that can differentiate one learning algorithm from another. You already know that different learning algorithms can have different hyperparameters (CC in SVM, ϵ\epsilon and dd in ID3). Solvers such as gradient descent can also have hyperparameters, like α\alpha for example.

某些算法(例如决策树学习)可以接受分类特征。例如,如果您有一个特征“颜色”,可以采用“红色”、“黄色”或“绿色”值,则可以保持此特征不变。 SVM、逻辑回归和线性回归以及 kNN(具有余弦相似度或欧几里得距离度量)期望所有特征都有数值。 scikit-learn 中实现的所有算法都期望数字特征。在下一章中,我将展示如何将分类特征转换为数值特征。

Some algorithms, like decision tree learning, can accept categorical features. For example, if you have a feature “color” that can take values “red”, “yellow”, or “green”, you can keep this feature as is. SVM, logistic and linear regression, as well as kNN (with cosine similarity or Euclidean distance metrics), expect numerical values for all features. All algorithms implemented in scikit-learn expect numerical features. In the next chapter, I show how to convert categorical features into numerical ones.

某些算法(例如 SVM)允许数据分析师为每个类别提供权重。这些权重影响决策边界的绘制方式。如果某个类别的权重很高,学习算法会尝试在预测该类别的训练示例时不犯错误(通常,以在其他地方犯错误为代价)。如果某些类的实例在训练数据中占少数,那么这一点可能很重要,但您希望尽可能避免对该类的示例进行错误分类。

Some algorithms, like SVM, allow the data analyst to provide weightings for each class. These weightings influence how the decision boundary is drawn. If the weight of some class is high, the learning algorithm tries to not make errors in predicting training examples of this class (typically, for the cost of making an error elsewhere). That could be important if instances of some class are in the minority in your training data, but you would like to avoid misclassifying examples of that class as much as possible.

一些分类模型,如 SVM 和 kNN,给定特征向量时仅输出类别。其他的,如逻辑回归或决策树,也可以返回之间的分数0011它可以解释为模型对预测的置信度或输入示例属于某个类别4的概率。

Some classification models, like SVM and kNN, given a feature vector only output the class. Others, like logistic regression or decision trees, can also return the score between 00 and 11 which can be interpreted as either how confident the model is about the prediction or as the probability that the input example belongs to a certain class4.

一些分类算法(例如决策树学习、逻辑回归或 SVM)会立即使用整个数据集构建模型。如果您有额外的标记示例,则必须从头开始重建模型。其他算法(例如 scikit-learn 中的朴素贝叶斯、多层感知器、SGDClassifier/SGDRegressor、PassiveAggressiveClassifier/PassiveAggressiveRegressor)可以迭代训练,一次一批。一旦有新的训练示例可用,您就可以仅使用新数据更新模型。

Some classification algorithms (like decision tree learning, logistic regression, or SVM) build the model using the whole dataset at once. If you have got additional labeled examples, you have to rebuild the model from scratch. Other algorithms (such as Naïve Bayes, multilayer perceptron, SGDClassifier/SGDRegressor, PassiveAggressiveClassifier/PassiveAggressiveRegressor in scikit-learn) can be trained iteratively, one batch at a time. Once new training examples are available, you can update the model using only the new data.

最后,一些算法,如决策树学习、SVM 和 kNN,可以同时用于分类和回归,而其他算法只能解决一类问题:分类或回归,但不能同时解决两者。

Finally, some algorithms, like decision tree learning, SVM, and kNN can be used for both classification and regression, while others can only solve one type of problem: either classification or regression, but not both.

通常,每个库都会提供文档来解释每个算法解决什么类型的问题、允许哪些输入值以及模型产生什么类型的输出。该文档还提供了有关超参数的信息。

Usually, each library provides the documentation that explains what kind of problem each algorithm solves, what input values are allowed and what kind of output the model produces. The documentation also provides information on hyperparameters.


  1. 如您所知,线性回归有一个封闭式解。这意味着解决这种特定类型的问题不需要梯度下降。然而,出于说明目的,线性回归是解释梯度下降的完美问题。

  2. As you know, linear regression has a closed form solution. That means that gradient descent is not needed to solve this specific type of problem. However, for illustration purposes, linear regression is a perfect problem to explain gradient descent.

  3. 在复杂模型中,例如具有数千个参数的神经网络,参数的初始化可能会显着影响使用梯度下降找到的解决方案。有不同的初始化方法(随机、全零、零附近的小值等等),这是数据分析师必须做出的重要选择。

  4. In complex models, such as neural networks, which have thousands of parameters, the initialization of parameters may significantly affect the solution found using gradient descent. There are different initialization methods (at random, with all zeroes, with small values around zero, and others) and it is an important choice the data analyst has to make.

  5. 点由参数的当前值给出。

  6. A point is given by the current values of parameters.

  7. 如果确实有必要,可以使用简单的技术综合创建 SVM 和 kNN 预测的分数。

  8. If it’s really necessary, the score for SVM and kNN predictions could be synthetically created using simple techniques.

5基本练习

5 Basic Practice

到目前为止,我只是顺便提到了数据分析师在处理机器学习问题时需要考虑的一些问题:特征工程、过拟合和超参数调整。在本章中,我们将讨论这些以及在您可以键入之前必须解决的其他挑战模型 = LogisticRegression().fit(x,y)\text{model = LogisticRegression().fit(x,y)}在 scikit-learn 中。

Until now, I only mentioned in passing some issues that a data analyst needs to consider when working on a machine learning problem: feature engineering, overfitting, and hyperparameter tuning. In this chapter, we talk about these and other challenges that have to be addressed before you can type model = LogisticRegression().fit(x,y)\text{model = LogisticRegression().fit(x,y)} in scikit-learn.

5.1特征工程

5.1 Feature Engineering

当产品经理告诉你“我们需要能够预测特定客户是否会留在我们身边。这是五年来客户与我们产品互动的日志。”您不能仅仅获取数据,将其加载到库中并获得预测。您需要首先构建一个数据集

When a product manager tells you “We need to be able to predict whether a particular customer will stay with us. Here are the logs of customers’ interactions with our product for five years.” you cannot just grab the data, load it into a library and get a prediction. You need to build a dataset first.

请记住第一章中的数据集是标记示例的集合 {𝐱,y}=1\{(\mathbf{x}_i, y_i)\}_{i=1}^N。每个元素𝐱\mathbf{x}_i之中N称为特征向量。特征向量是一个向量,其中每个维度j=1,……,Dj=1, \ldots, D包含一个以某种方式描述示例的值。该值称为特征,表示为Xjx^{(j)}

Remember from the first chapter that the dataset is the collection of labeled examples {(𝐱i,yi)}i=1N\{(\mathbf{x}_i, y_i)\}_{i=1}^N. Each element 𝐱i\mathbf{x}_i among NN is called a feature vector. A feature vector is a vector in which each dimension j=1,,Dj=1, \ldots, D contains a value that describes the example somehow. That value is called a feature and is denoted as x(j)x^{(j)}.

将原始数据转换为数据集的问题称为特征工程。对于大多数实际问题,特征工程是一个劳动密集型过程,需要数据分析师大量的创造力,最好是领域知识。

The problem of transforming raw data into a dataset is called feature engineering. For most practical problems, feature engineering is a labor-intensive process that demands from the data analyst a lot of creativity and, preferably, domain knowledge.

例如,为了转换用户与计算机系统交互的日志,可以创建包含有关用户的信息以及从日志中提取的各种统计信息的特征。对于每个用户,一项功能将包含订阅价格;其他功能将包含每天、每周和每年的连接频率。另一项功能将包含平均会话持续时间(以秒为单位)或一个请求的平均响应时间等。一切可测量的东西都可以用作特征。数据分析师的作用是创建信息丰富的特征:这些特征将使学习算法能够构建一个模型,该模型可以很好地预测用于训练的数据的标签。高信息量特征也称为具有高预测能力的特征。例如,用户会话的平均持续时间对于预测用户将来是否会继续使用该应用程序的问题具有很高的预测能力。

For example, to transform the logs of user interaction with a computer system, one could create features that contain information about the user and various statistics extracted from the logs. For each user, one feature would contain the price of the subscription; other features would contain the frequency of connections per day, week and year. Another feature would contain the average session duration in seconds or the average response time for one request, and so on. Everything measurable can be used as a feature. The role of the data analyst is to create informative features: those would allow the learning algorithm to build a model that does a good job of predicting labels of the data used for training. Highly informative features are also called features with high predictive power. For example, the average duration of a user’s session has high predictive power for the problem of predicting whether the user will keep using the application in the future.

当模型能够很好地预测训练数据时,我们说模型具有较低的偏差。也就是说,当我们使用模型来预测用于构建模型的示例的标签时,该模型几乎不会犯错误。

We say that a model has a low bias when it predicts the training data well. That is, the model makes few mistakes when we use it to predict labels of the examples used to build the model.

5.1.1 One-Hot 编码

5.1.1 One-Hot Encoding

一些学习算法仅适用于数值特征向量。当数据集中的某些特征是分类特征时,例如“颜色”或“一周中的几天”,您可以将此类分类特征转换为多个二进制特征。

Some learning algorithms only work with numerical feature vectors. When some feature in your dataset is categorical, like “colors” or “days of the week,” you can transform such a categorical feature into several binary ones.

如果您的示例具有分类特征“颜色”,并且该特征具有三个可能的值:“红色”、“黄色”、“绿色”,则可以将此特征转换为由三个数值组成的向量:

If your example has a categorical feature “colors” and this feature has three possible values: “red,” “yellow,” “green,” you can transform this feature into a vector of three numerical values:

红色的=[1,0,0]黄色的=[0,1,0]绿色的=[0,0,1]\begin{equation} \begin{split} \text{red} &= [1,0,0] \\ \text{yellow} &= [0,1,0] \\ \text{green} &= [0,0,1] \end{split} \end{equation}

red=[1,0,0]yellow=[0,1,0]green=[0,0,1]\begin{equation} \begin{split} \text{red} &= [1,0,0] \\ \text{yellow} &= [0,1,0] \\ \text{green} &= [0,0,1] \end{split} \end{equation}

通过这样做,您可以增加特征向量的维度。你不应该把红色变成11, 黄色变为22,并将绿色变为33避免增加维度,因为这意味着该类别中的值之间存在顺序,并且这种特定顺序对于决策非常重要。如果特征值的顺序不重要,则使用有序数字作为值可能会混淆学习算法,1因为算法会尝试找到不存在的规律性,这可能会导致过度拟合。

By doing so, you increase the dimensionality of your feature vectors. You should not transform red into 11, yellow into 22, and green into 33 to avoid increasing the dimensionality because that would imply that there’s an order among the values in this category and this specific order is important for the decision making. If the order of a feature’s values is not important, using ordered numbers as values is likely to confuse the learning algorithm,1 because the algorithm will try to find a regularity where there’s no one, which may potentially lead to overfitting.

5.1.2分箱

5.1.2 Binning

另一种相反的情况在实践中很少出现,即您有一个数字特征,但您想将其转换为分类特征。分箱(也称为分桶)是将连续特征转换为多个二进制特征(通常基于值范围)的过程,这些二值特征称为箱或桶。例如,分析师可以将年龄范围划分为离散的容器,而不是将年龄表示为单个实值特征:之间的所有年龄0055岁数可以放入一个垃圾箱,661010岁数可能在第二个垃圾箱中,11111515岁数可能位于第三个垃圾箱中,依此类推。

An opposite situation, occurring less frequently in practice, is when you have a numerical feature but you want to convert it into a categorical one. Binning (also called bucketing) is the process of converting a continuous feature into multiple binary features called bins or buckets, typically based on value range. For example, instead of representing age as a single real-valued feature, the analyst could chop ranges of age into discrete bins: all ages between 00 and 55 years-old could be put into one bin, 66 to 1010 years-old could be in the second bin, 1111 to 1515 years-old could be in the third bin, and so on.

例如,让特征j=4j=4代表年龄。通过应用分箱,我们将此功能替换为相应的分箱。让三个新的 bin“age_bin1”、“age_bin2”和“age_bin3”添加索引j=123j=123,j=124j=124j=125j=125分别。现在如果X4=7x_i^{(4)} = 7举个例子𝐱\mathbf{x}_i,然后我们设置特征X124x_i^{(124)}11;如果X4=13x_i^{(4)} = 13,然后我们设置特征X125x_i^{(125)}11, 等等。

For example, let feature j=4j=4 represent age. By applying binning, we replace this feature with the corresponding bins. Let the three new bins, “age_bin1”, “age_bin2” and “age_bin3” be added with indexes j=123j=123, j=124j=124 and j=125j=125 respectively. Now if xi(4)=7x_i^{(4)} = 7 for some example 𝐱i\mathbf{x}_i, then we set feature xi(124)x_i^{(124)} to 11; if xi(4)=13x_i^{(4)} = 13, then we set feature xi(125)x_i^{(125)} to 11, and so on.

在某些情况下,精心设计的分箱可以帮助学习算法使用更少的示例进行学习。发生这种情况是因为我们向学习算法给出了一个“提示”:如果某个特征的值落在特定范围内,则该特征的确切值并不重要。

In some cases, a carefully designed binning can help the learning algorithm to learn using fewer examples. It happens because we give a “hint” to the learning algorithm that if the value of a feature falls within a specific range, the exact value of the feature doesn’t matter.

5.1.3标准化

5.1.3 Normalization

归一化是将数值特征可以采用的实际值范围转换为标准值范围(通常在区间内)的过程[-1,1][-1,1]或者[0,1][0,1]

Normalization is the process of converting an actual range of values which a numerical feature can take, into a standard range of values, typically in the interval [1,1][-1,1] or [0,1][0,1].

例如,假设特定特征的自然范围是35035014501450。通过减去350350来自特征的每个值,并将结果除以11001100,可以将这些值归一化到范围内[0,1][0,1]

For example, suppose the natural range of a particular feature is 350350 to 14501450. By subtracting 350350 from every value of the feature, and dividing the result by 11001100, one can normalize those values into the range [0,1][0,1].

更一般地,归一化公式如下所示:Xj=Xj-njAXj-nj, \bar{x}^{(j)} = \frac{x^{(j)} - min^{(j)}}{max^{(j)} - min^{(j)}},在哪里njmin^{(j)}AXjmax^{(j)}分别是特征的最小值和最大值jj在数据集中。

More generally, the normalization formula looks like this: x(j)=x(j)min(j)max(j)min(j), \bar{x}^{(j)} = \frac{x^{(j)} - min^{(j)}}{max^{(j)} - min^{(j)}}, where min(j)min^{(j)} and max(j)max^{(j)} are, respectively, the minimum and the maximum value of the feature jj in the dataset.

我们为什么要正常化?标准化数据并不是严格的要求。然而,在实践中,它可以提高学习速度。记住上一章中的梯度下降示例。假设您有一个二维特征向量。当您更新参数时w1w^{(1)}w2w^{(2)},您使用均方误差的偏导数w1w^{(1)}w2w^{(2)}。如果X1x^{(1)}是在范围内[0,1000][0,1000]X2x^{(2)}范围[0,0.0001][0,0.0001],那么相对于较大特征的导数将主导更新。

Why do we normalize? Normalizing the data is not a strict requirement. However, in practice, it can lead to an increased speed of learning. Remember the gradient descent example from the previous chapter. Imagine you have a two-dimensional feature vector. When you update the parameters of w(1)w^{(1)} and w(2)w^{(2)}, you use partial derivatives of the mean squared error with respect to w(1)w^{(1)} and w(2)w^{(2)}. If x(1)x^{(1)} is in the range [0,1000][0,1000] and x(2)x^{(2)} the range [0,0.0001][0,0.0001], then the derivative with respect to a larger feature will dominate the update.

此外,确保我们的输入大致在相同的相对较小的范围内是有用的,以避免计算机在处理非常小或非常大的数字时出现问题(称为数字溢出)。

Additionally, it’s useful to ensure that our inputs are roughly in the same relatively small range to avoid problems which computers have when working with very small or very big numbers (known as numerical overflow).

5.1.4标准化

5.1.4 Standardization

标准化(或z 分数标准化)是重新调整特征值的过程,以便它们具有标准正态分布的属性μ=0\mu = 0σ=1\sigma=1, 在哪里μ\mu是平均值(特征的平均值,对数据集中所有示例进行平均)并且σ\sigma是平均值的标准差。

Standardization (or z-score normalization) is the procedure during which the feature values are rescaled so that they have the properties of a standard normal distribution with μ=0\mu = 0 and σ=1\sigma=1, where μ\mu is the mean (the average value of the feature, averaged over all examples in the dataset) and σ\sigma is the standard deviation from the mean.

特征的标准分数(或 z 分数)计算如下:

Standard scores (or z-scores) of features are calculated as follows:

X̂j=Xj-μjσj\hat{x}^{(j)} = \frac{x^{(j)} - \mu^{(j)}}{\sigma^{(j)}}.

x̂(j)=x(j)μ(j)σ(j).\hat{x}^{(j)} = \frac{x^{(j)} - \mu^{(j)}}{\sigma^{(j)}}.

您可能会问什么时候应该使用标准化,什么时候应该使用标准化。这个问题没有明确的答案。通常,如果您的数据集不太大并且您有时间,您可以尝试两者,看看哪一个对您的任务表现更好。

You may ask when you should use normalization and when standardization. There’s no definitive answer to this question. Usually, if your dataset is not too big and you have time, you can try both and see which one performs better for your task.

如果您没有时间进行多个实验,根据经验:

If you don’t have time to run multiple experiments, as a rule of thumb:

  • 在实践中,无监督学习算法通常更受益于标准化而不是归一化;
  • unsupervised learning algorithms, in practice, more often benefit from standardization than from normalization;
  • 如果某个特征所取的值分布接近正态分布(所谓的钟形曲线),那么标准化也是首选;
  • standardization is also preferred for a feature if the values this feature takes are distributed close to a normal distribution (so-called bell curve);
  • 同样,如果某个特征有时具有极高或极低的值(异常值),那么标准化是首选;这是因为归一化会将正常值“压缩”到一个非常小的范围内;
  • again, standardization is preferred for a feature if it can sometimes have extremely high or low values (outliers); this is because normalization will “squeeze” the normal values into a very small range;
  • 在所有其他情况下,标准化是更好的选择。
  • in all other cases, normalization is preferable.

特征重新缩放通常对大多数学习算法有益。然而,您可以在流行的库中找到学习算法的现代实现,它们对于不同范围内的特征具有鲁棒性。

Feature rescaling is usually beneficial to most learning algorithms. However, modern implementations of the learning algorithms, which you can find in popular libraries, are robust to features lying in different ranges.

5.1.5处理缺失的特征

5.1.5 Dealing with Missing Features

在某些情况下,数据以具有已定义特征的数据集的形式提供给分析师。在一些示例中,一些特征的值可能缺失。当数据集是手工制作的,并且处理数据集的人员忘记填写某些值或根本没有对它们进行测量时,经常会发生这种情况。

In some cases, the data comes to the analyst in the form of a dataset with features already defined. In some examples, values of some features can be missing. That often happens when the dataset was handcrafted, and the person working on it forgot to fill some values or didn’t get them measured at all.

处理特征缺失值的典型方法包括:

The typical approaches of dealing with missing values for a feature include:

  • 从数据集中删除缺少特征的示例(如果您的数据集足够大,可以牺牲一些训练示例,则可以这样做);
  • removing the examples with missing features from the dataset (that can be done if your dataset is big enough so you can sacrifice some training examples);
  • 使用可以处理缺失特征值的学习算法(取决于库和算法的具体实现);
  • using a learning algorithm that can deal with missing feature values (depends on the library and a specific implementation of the algorithm);
  • 使用数据插补技术。
  • using a data imputation technique.

5.1.6数据插补技术

5.1.6 Data Imputation Techniques

一种数据插补技术包括用数据集中某个特征的平均值替换该特征的缺失值:X̂j1Σ=1Xj \hat{x}^{(j)} \gets \frac{1}{N}\sum_{i=1}^{N}x_i^{(j)}.

One data imputation technique consists in replacing the missing value of a feature by an average value of this feature in the dataset: x̂(j)1Ni=1Nxi(j). \hat{x}^{(j)} \gets \frac{1}{N}\sum_{i=1}^{N}x_i^{(j)}.

另一种技术是用正常值范围之外的值替换缺失值。例如,如果正常范围是[0,1][0,1],那么您可以将缺失值设置为22或者-1-1。这个想法是,当特征的值与常规值显着不同时,学习算法将学习最好做什么。或者,您可以用范围中间的值替换缺失值。例如,如果某个特征的范围是[-1,1][-1,1],您可以将缺失值设置为等于00。这里的想法是,范围中间的值不会显着影响预测。

Another technique is to replace the missing value with a value outside the normal range of values. For example, if the normal range is [0,1][0,1], then you can set the missing value to 22 or 1-1. The idea is that the learning algorithm will learn what is best to do when the feature has a value significantly different from regular values. Alternatively, you can replace the missing value by a value in the middle of the range. For example, if the range for a feature is [1,1][-1,1], you can set the missing value to be equal to 00. Here, the idea is that the value in the middle of the range will not significantly affect the prediction.

更先进的技术是使用缺失值作为回归问题的目标变量。您可以使用所有剩余功能[X1,X2,……,Xj-1,Xj+1,……,XD][x^{(1)}_i, x^{(2)}_i, \ldots , x^{(j-1)}_i, x^{(j+1)}_i, \ldots, x^{(D)}_i]形成特征向量𝐱̂\hat{\mathbf{x}}_i, 放ŷXj\hat{y}_i \gets x^{(j)}, 在哪里jj是具有缺失值的特征。然后你建立一个回归模型来预测ŷ\hat{y}𝐱̂\hat{\mathbf{x}}。当然,要构建训练示例𝐱̂,ŷ(\hat{\mathbf{x}},\hat{y}),您只使用原始数据集中的那些示例,其中特征的值jj存在。

A more advanced technique is to use the missing value as the target variable for a regression problem. You can use all remaining features [xi(1),xi(2),,xi(j1),xi(j+1),,xi(D)][x^{(1)}_i, x^{(2)}_i, \ldots , x^{(j-1)}_i, x^{(j+1)}_i, \ldots, x^{(D)}_i] to form a feature vector 𝐱̂i\hat{\mathbf{x}}_i, set ŷix(j)\hat{y}_i \gets x^{(j)}, where jj is the feature with a missing value. Then you build a regression model to predict ŷ\hat{y} from 𝐱̂\hat{\mathbf{x}}. Of course, to build training examples (𝐱̂,ŷ)(\hat{\mathbf{x}},\hat{y}), you only use those examples from the original dataset, in which the value of feature jj is present.

最后,如果您有一个非常大的数据集,并且只有几个具有缺失值的特征,则可以通过为每个具有缺失值的特征添加二进制指示符特征来增加特征向量的维数。说说特色吧j=12j = 12在你的DD维数据集有缺失值。对于每个特征向量𝐱\mathbf{x},然后添加该功能j=D+1j = D+1这等于11如果特征值1212存在于𝐱\mathbf{x}00否则。缺失的特征值可以替换为00或您选择的任意数量。

Finally, if you have a significantly large dataset and just a few features with missing values, you can increase the dimensionality of your feature vectors by adding a binary indicator feature for each feature with missing values. Let’s say feature j=12j = 12 in your DD-dimensional dataset has missing values. For each feature vector 𝐱\mathbf{x}, you then add the feature j=D+1j = D+1 which is equal to 11 if the value of feature 1212 is present in 𝐱\mathbf{x} and 00 otherwise. The missing feature value then can be replaced by 00 or any number of your choice.

在预测时,如果您的示例不完整,您应该使用与完成训练数据所用的技术相同的数据插补技术来填充缺失的特征。

At prediction time, if your example is not complete, you should use the same data imputation technique to fill the missing features as the technique you used to complete the training data.

在开始解决学习问题之前,您无法判断哪种数据插补技术最有效。尝试多种技术,构建多种模型,然后选择最有效的一种。

Before you start working on the learning problem, you cannot tell which data imputation technique will work the best. Try several techniques, build several models and select the one that works the best.

5.2学习算法选择

5.2 Learning Algorithm Selection

选择机器学习算法可能是一项艰巨的任务。如果你时间充裕,可以全部都尝试一下。然而,通常解决问题的时间是有限的。在开始解决问题之前,您可以问自己几个问题。根据您的答案,您可以列出一些算法并在您的数据上进行尝试。

Choosing a machine learning algorithm can be a difficult task. If you have much time, you can try all of them. However, usually the time you have to solve a problem is limited. You can ask yourself several questions before starting to work on the problem. Depending on your answers, you can shortlist some algorithms and try them on your data.

  • 可解释性
  • Explainability

您的模型是否必须能够向非技术受众解释?大多数非常准确的学习算法都是所谓的“黑匣子”。他们学习的模型很少出错,但模型为何做出特定预测可能很难理解,甚至更难解释。此类模型的示例是神经网络或集成模型。

Does your model have to be explainable to a non-technical audience? Most very accurate learning algorithms are so-called “black boxes.” They learn models that make very few errors, but why a model made a specific prediction could be very hard to understand and even harder to explain. Examples of such models are neural networks or ensemble models.

另一方面,kNN、线性回归或决策树学习算法生成的模型并不总是最准确的,但是它们的预测方式非常简单。

On the other hand, kNN, linear regression, or decision tree learning algorithms produce models that are not always the most accurate, however, the way they make their prediction is very straightforward.

  • 内存中与内存外
  • In-memory vs. out-of-memory

您的数据集可以完全加载到服务器或个人计算机的 RAM 中吗?如果是,那么您可以从多种算法中进行选择。否则,您会更喜欢增量学习算法,它可以通过逐渐添加更多数据来改进模型。

Can your dataset be fully loaded into the RAM of your server or personal computer? If yes, then you can choose from a wide variety of algorithms. Otherwise, you would prefer incremental learning algorithms that can improve the model by adding more data gradually.

  • 特征和示例的数量
  • Number of features and examples

您的数据集中有多少个训练示例?每个示例有多少个特征?一些算法,包括神经网络梯度提升(我们稍后会考虑),可以处理大量的示例和数百万个特征。其他的,比如 SVM,其能力可能非常有限。

How many training examples do you have in your dataset? How many features does each example have? Some algorithms, including neural networks and gradient boosting (we consider both later), can handle a huge number of examples and millions of features. Others, like SVM, can be very modest in their capacity.

  • 分类特征与数值特征
  • Categorical vs. numerical features

您的数据是仅由分类特征组成,还是仅由数值特征组成,还是两者的混合?根据您的答案,某些算法无法直接处理您的数据集,您需要将分类特征转换为数值特征。

Is your data composed of categorical only, or numerical only features, or a mix of both? Depending on your answer, some algorithms cannot handle your dataset directly, and you would need to convert your categorical features into numerical ones.

  • 数据的非线性
  • Nonlinearity of the data

您的数据是线性可分离的还是可以使用线性模型进行建模?如果是,带有线性核、逻辑回归或线性回归的 SVM 可能是不错的选择。否则,第 6 章和第 7 章中讨论的深度神经网络或集成算法可能会效果更好。

Is your data linearly separable or can it be modeled using a linear model? If yes, SVM with the linear kernel, logistic or linear regression can be good choices. Otherwise, deep neural networks or ensemble algorithms, discussed in Chapters 6 and 7, might work better.

  • 训练速度
  • Training speed

学习算法允​​许使用多少时间来构建模型?众所周知,神经网络的训练速度很慢。逻辑回归、线性回归或决策树等简单算法要快得多。专门的库包含一些算法的非常有效的实现;您可能更喜欢在线研究以查找此类库。一些算法(例如随机森林)受益于多个 CPU 内核的可用性,因此在具有数十个内核的计算机上可以显着减少其模型构建时间。

How much time is a learning algorithm allowed to use to build a model? Neural networks are known to be slow to train. Simple algorithms like logistic and linear regression or decision trees are much faster. Specialized libraries contain very efficient implementations of some algorithms; you may prefer to do research online to find such libraries. Some algorithms, such as random forests, benefit from the availability of multiple CPU cores, so their model building time can be significantly reduced on a machine with dozens of cores.

  • 预测速度
  • Prediction speed

模型生成预测时的速度必须有多快?您的模型是否会用于需要非常高吞吐量的生产中?支持向量机、线性回归和逻辑回归以及(某些类型的)神经网络等算法在预测时速度非常快。其他算法,如 kNN、集成算法以及非常深或循环的神经网络,则速度较慢2

How fast does the model have to be when generating predictions? Will your model be used in production where very high throughput is required? Algorithms like SVMs, linear and logistic regression, and (some types of) neural networks, are extremely fast at the prediction time. Others, like kNN, ensemble algorithms, and very deep or recurrent neural networks, are slower2.

如果您不想猜测最适合您的数据的算法,选择算法的一种流行方法是在验证集上进行测试。我们在下面讨论这一点。或者,如果您使用 scikit-learn,您可以尝试如图 1 所示的算法选择图。  17 .

If you don’t want to guess the best algorithm for your data, a popular way to choose one is by testing it on the validation set. We talk about that below. Alternatively, if you use scikit-learn, you could try their algorithm selection diagram shown in fig. 17.

5.3三组

5.3 Three Sets

到目前为止,我交替使用“数据集”和“训练集”这两个表达方式。然而,在实践中,数据分析师使用三组不同的标记示例:

Until now, I used the expressions “dataset” and “training set” interchangeably. However, in practice data analysts work with three distinct sets of labeled examples:

  1. 训练集,
  2. training set,
  3. 验证集,以及
  4. validation set, and
  5. 测试集。
  6. test set.

获得带注释的数据集后,您要做的第一件事就是打乱示例并将数据集分为三个子集:训练、验证和测试。训练集通常是最大的;你用它来构建模型。验证集和测试集的大小大致相同,比训练集的大小小得多。学习算法无法使用这两个子集中的示例来构建模型。这就是为什么这两个集合通常被称为保留集合

Once you have got your annotated dataset, the first thing you do is you shuffle the examples and split the dataset into three subsets: training, validation, and test. The training set is usually the biggest one; you use it to build the model. The validation and test sets are roughly the same sizes, much smaller than the size of the training set. The learning algorithm cannot use examples from these two subsets to build the model. That is why those two sets are often called holdout sets.

将数据集划分为这三个子集没有最佳比例。过去,经验法则是使用 70% 的数据集进行训练,15% 用于验证,15% 用于测试。然而,在大数据时代,数据集通常有数百万个示例。在这种情况下,保留 95% 用于训练和 2.5%/2.5% 用于验证/测试可能是合理的。

There’s no optimal proportion to split the dataset into these three subsets. In the past, the rule of thumb was to use 70% of the dataset for training, 15% for validation and 15% for testing. However, in the age of big data, datasets often have millions of examples. In such cases, it could be reasonable to keep 95% for training and 2.5%/2.5% for validation/testing.

您可能会想,为什么要三套而不是一套呢?答案很简单:当我们构建模型时,我们不希望模型只擅长预测学习算法已经见过的示例标签。简单地记住所有训练样本,然后使用内存来“预测”它们的标签的简单算法在被要求预测训练样本的标签时不会出错,但这样的算法在实践中是无用的。我们真正想要的是一个善于预测学习算法没有看到的例子的模型:我们希望在保留集上有良好的性能。

You may wonder, what is the reason to have three sets and not one. The answer is simple: when we build a model, what we do not want is for the model to only do well at predicting labels of examples the learning algorithms has already seen. A trivial algorithm that simply memorizes all training examples and then uses the memory to “predict” their labels will make no mistakes when asked to predict the labels of the training examples, but such an algorithm would be useless in practice. What we really want is a model that is good at predicting examples that the learning algorithm didn’t see: we want good performance on a holdout set.

图 17:scikit-learn 的机器学习算法选择图。
图 17:scikit-learn 的机器学习算法选择图。

为什么我们需要两组而不是一组?我们使用验证集来 1)选择学习算法和 2)找到超参数的最佳值。在将模型交付给客户或投入生产之前,我们使用测试集来评估模型。

Why do we need two holdout sets and not one? We use the validation set to 1) choose the learning algorithm and 2) find the best values of hyperparameters. We use the test set to assess the model before delivering it to the client or putting it in production.

5.4欠拟合和过拟合

5.4 Underfitting and Overfitting

我上面提到了偏见的概念。我说过,如果模型能够很好地预测训练数据的标签,那么它的偏差就较低。如果模型在训练数据上犯了很多错误,我们就说模型有高偏差或者模型欠拟合。因此,欠拟合是指模型无法很好地预测其训练数据的标签。欠拟合可能有多种原因,其中最重要的是:

I mentioned above the notion of bias. I said that a model has a low bias if it predicts well the labels of the training data. If the model makes many mistakes on the training data, we say that the model has a high bias or that the model underfits. So, underfitting is the inability of the model to predict well the labels of the data it was trained on. There could be several reasons for underfitting, the most important of which are:

  • 您的模型对于数据来说太简单(例如线性模型通常可能欠拟合);
  • your model is too simple for the data (for example a linear model can often underfit);
  • 您设计的功能信息不够丰富。
  • the features you engineered are not informative enough.
图 18:欠拟合示例(线性模型)。
图 18:欠拟合示例(线性模型)。
图 19:良好拟合的示例(二次模型)。
图 19:良好拟合的示例(二次模型)。
图 20:过度拟合的示例(15 次多项式)。
图 20:过度拟合的示例(15 次多项式)。

第一个原因很容易在一维回归的情况下说明:数据集可以类似于曲线,但我们的模型是直线。第二个原因可以这样说明:假设你想预测一个病人是否患有癌症,你拥有的特征是身高、血压和心率。这三个特征显然不是癌症的良好预测因子,因此我们的模型将无法学习这些特征和标签之间有意义的关系。

The first reason is easy to illustrate in the case of one-dimensional regression: the dataset can resemble a curved line, but our model is a straight line. The second reason can be illustrated like this: let’s say you want to predict whether a patient has cancer, and the features you have are height, blood pressure, and heart rate. These three features are clearly not good predictors for cancer so our model will not be able to learn a meaningful relationship between these features and the label.

欠拟合问题的解决方案是尝试更复杂的模型或设计具有更高预测能力的特征。

The solution to the problem of underfitting is to try a more complex model or to engineer features with higher predictive power.

过度拟合是模型可能出现的另一个问题。过拟合的模型对训练数据的预测效果很好,但对来自两个保留集之一的数据的预测效果很差。我已经在第 3 章中对过拟合进行了说明。导致过拟合的原因有几个,其中最重要的是:

Overfitting is another problem a model can exhibit. The model that overfits predicts very well the training data but poorly the data from at least one of the two holdout sets. I already gave an illustration of overfitting in Chapter 3. Several reasons can lead to overfitting, the most important of which are:

  • 您的模型对于数据来说太复杂(例如非常高的决策树或非常深或宽的神经网络通常会过度拟合);
  • your model is too complex for the data (for example a very tall decision tree or a very deep or wide neural network often overfit);
  • 你的特征太多,但训练示例却很少。
  • you have too many features but a small number of training examples.

在文献中,你可以找到过拟合问题的另一个名称:高方差问题。这个术语来自统计。方差是模型的误差,因为它对训练集中的小波动很敏感。这意味着如果您的训练数据采样方式不同,学习将产生显着不同的模型。这就是为什么过度拟合的模型在测试数据上表现不佳的原因:测试数据和训练数据是从数据集中独立采样的。

In the literature, you can find another name for the problem of overfitting: the problem of high variance. This term comes from statistics. The variance is an error of the model due to its sensitivity to small fluctuations in the training set. It means that if your training data was sampled differently, the learning would result in a significantly different model. Which is why the model that overfits performs poorly on the test data: test and training data are sampled from the dataset independently of one another.

即使是最简单的模型(例如线性模型)也可能会过度拟合数据。当数据是高维的,但训练样本的数量相对较少时,通常会发生这种情况。事实上,当特征向量非常高维时,线性学习算法可以构建一个为大多数参数分配非零值的模型wjw^{(j)}在参数向量中𝐰\mathbf{w},试图找到所有可用特征之间非常复杂的关系,以完美预测训练示例的标签。

Even the simplest model, such as linear, can overfit the data. That usually happens when the data is high-dimensional, but the number of training examples is relatively low. In fact, when feature vectors are very high-dimensional, the linear learning algorithm can build a model that assigns non-zero values to most parameters w(j)w^{(j)} in the parameter vector 𝐰\mathbf{w}, trying to find very complex relationships between all available features to predict labels of training examples perfectly.

如此复杂的模型很可能对保留示例的标签的预测效果很差。这是因为,通过尝试完美地预测所有训练样本的标签,模型还将学习训练集的特性:训练样本特征值中的噪声、由于数据集较小而导致的采样不完美等手头决策问题外在但存在于训练集中的伪影。

Such a complex model will most likely predict poorly the labels of the holdout examples. This is because by trying to perfectly predict labels of all training examples, the model will also learn the idiosyncrasies of the training set: the noise in the values of features of the training examples, the sampling imperfection due to the small dataset size, and other artifacts extrinsic to the decision problem at hand but present in the training set.

情节如图。  18-图。 图20示出了回归模型对数据欠拟合、拟合良好和过拟合的一维数据集。

Plots in fig. 18-fig. 20 illustrate a one-dimensional dataset for which a regression model underfits, fits well and overfits the data.

过度拟合问题的解决方案有多种:

Several solutions to the problem of overfitting are possible:

  1. 尝试更简单的模型(线性而不是多项式回归,或者使用线性核而不是 RBF 的 SVM,具有更少层/单元的神经网络)。
  2. Try a simpler model (linear instead of polynomial regression, or SVM with a linear kernel instead of RBF, a neural network with fewer layers/units).
  3. 降低数据集中示例的维度(例如,通过使用第 9 章中讨论的降维技术之一)。
  4. Reduce the dimensionality of examples in the dataset (for example, by using one of the dimensionality reduction techniques discussed in Chapter 9).
  5. 如果可能的话,添加更多训练数据。
  6. Add more training data, if possible.
  7. 正则化模型。
  8. Regularize the model.

正则化是最广泛使用的防止过度拟合的方法。

Regularization is the most widely used approach to prevent overfitting.

5.5正则化

5.5 Regularization

正则化是一个总括术语,它包含迫使学习算法构建不太复杂的模型的方法。在实践中,这通常会导致稍高的偏差,但会显着降低方差。这个问题在文献中被称为偏差-方差权衡

Regularization is an umbrella term that encompasses methods that force the learning algorithm to build a less complex model. In practice, that often leads to slightly higher bias but significantly reduces the variance. This problem is known in the literature as the bias-variance tradeoff.

两种最广泛使用的正则化类型称为L1L2 正则化。这个想法很简单。为了创建正则化模型,我们通过添加惩罚项来修改目标函数,当模型更复杂时,惩罚项的值更高。

The two most widely used types of regularization are called L1 and L2 regularization. The idea is quite simple. To create a regularized model, we modify the objective function by adding a penalizing term whose value is higher when the model is more complex.

为简单起见,我使用线性回归的示例来说明正则化。相同的原理可以应用于多种模型。

For simplicity, I illustrate regularization using the example of linear regression. The same principle can be applied to a wide variety of models.

回想一下线性回归目标:分钟𝐰,1Σ=1F𝐰,𝐱-y216 \min_{\mathbf{w}, b} \frac{1}{N} \sum_{i=1}^N (f_{\mathbf{w}, b}(\mathbf{x}_i) - y_i)^2. \qquad(16)

Recall the linear regression objective: min𝐰,b1Ni=1N(f𝐰,b(𝐱i)yi)2.(16) \min_{\mathbf{w}, b} \frac{1}{N} \sum_{i=1}^N (f_{\mathbf{w}, b}(\mathbf{x}_i) - y_i)^2. \qquad(16)

L1 正则化目标如下所示:分钟𝐰,[C|𝐰|+1Σ=1F𝐰,𝐱-y2],17 号 \begin{split}\min_{\mathbf{w}, b}\Bigg[ C\lvert\mathbf{w}\rvert \\ + \frac{1}{N} \sum_{i=1}^N (f_{\mathbf{w}, b}(\mathbf{x}_i) - y_i)^2\Bigg],\end{split} \qquad(17)

An L1-regularized objective looks like this: min𝐰,b[C|𝐰|+1Ni=1N(f𝐰,b(𝐱i)yi)2],(17) \begin{split}\min_{\mathbf{w}, b}\Bigg[ C\lvert\mathbf{w}\rvert \\ + \frac{1}{N} \sum_{i=1}^N (f_{\mathbf{w}, b}(\mathbf{x}_i) - y_i)^2\Bigg],\end{split} \qquad(17)

在哪里|𝐰|=定义Σj=1D|wj||\mathbf{w}| \stackrel{\text{def}}{=} \sum_{j=1}^D |w^{(j)}|CC是控制正则化重要性的超参数。如果我们设置CC为零,该模型成为标准的非正则化线性回归模型。另一方面,如果我们设置为CC到一个高值,学习算法将尝试设置最多wjw^{(j)}到一个非常小的值或零以最小化目标,并且模型将变得非常简单,这可能导致欠拟合。作为数据分析师,您的角色是找到超参数的值CC这不会过多地增加偏差,但会将方差降低到当前问题的合理水平。在下一节中,我将展示如何做到这一点。

where |𝐰|=defj=1D|w(j)||\mathbf{w}| \stackrel{\text{def}}{=} \sum_{j=1}^D |w^{(j)}| and CC is a hyperparameter that controls the importance of regularization. If we set CC to zero, the model becomes a standard non-regularized linear regression model. On the other hand, if we set to CC to a high value, the learning algorithm will try to set most w(j)w^{(j)} to a very small value or zero to minimize the objective, and the model will become very simple which can lead to underfitting. Your role as the data analyst is to find such a value of the hyperparameter CC that doesn’t increase the bias too much but reduces the variance to a level reasonable for the problem at hand. In the next section, I will show how to do that.

L2 正则化目标如下所示:分钟𝐰,[C𝐰2+1Σ=1F𝐰,𝐱-y2],18 \begin{split}\min_{\mathbf{w}, b}\Bigg[ C\lVert\mathbf{w}\rVert^2 \\ + \frac{1}{N} \sum_{i=1}^N (f_{\mathbf{w}, b}(\mathbf{x}_i) - y_i)^2\Bigg],\end{split}\qquad(18)

An L2-regularized objective looks like this: min𝐰,b[C𝐰2+1Ni=1N(f𝐰,b(𝐱i)yi)2],(18) \begin{split}\min_{\mathbf{w}, b}\Bigg[ C\lVert\mathbf{w}\rVert^2 \\ + \frac{1}{N} \sum_{i=1}^N (f_{\mathbf{w}, b}(\mathbf{x}_i) - y_i)^2\Bigg],\end{split}\qquad(18)

在哪里𝐰2=定义Σj=1Dwj2\|\mathbf{w}\|^2 \stackrel{\text{def}}{=} \sum_{j=1}^D (w^{(j)})^2.

where 𝐰2=defj=1D(w(j))2.\|\mathbf{w}\|^2 \stackrel{\text{def}}{=} \sum_{j=1}^D (w^{(j)})^2.

实际上,L1 正则化会生成一个稀疏模型,该模型具有大部分参数(对于线性模型,大部分参数wjw^{(j)}) 等于 0,前提是超参数CC足够大。因此,L1通过决定哪些特征对预测至关重要、哪些不是,来执行特征选择。如果您想提高模型的可解释性,这可能很有用。但是,如果您的唯一目标是最大限度地提高模型在保留数据上的性能,那么 L2 通常会给出更好的结果。 L2还具有可微分的优点,因此可以使用梯度下降来优化目标函数。

In practice, L1 regularization produces a sparse model, a model that has most of its parameters (in case of linear models, most of w(j)w^{(j)}) equal to zero, provided the hyperparameter CC is large enough. So L1 performs feature selection by deciding which features are essential for prediction and which are not. That can be useful in case you want to increase model explainability. However, if your only goal is to maximize the performance of the model on the holdout data, then L2 usually gives better results. L2 also has the advantage of being differentiable, so gradient descent can be used for optimizing the objective function.

L1 和 L2 正则化方法也被组合在所谓的弹性网络正则化中,其中 L1 和 L2 正则化是特殊情况。您可以在文献中找到L2 的名称为岭正则化,L1 的名称为套索

L1 and L2 regularization methods were also combined in what is called elastic net regularization with L1 and L2 regularizations being special cases. You can find in the literature the name ridge regularization for L2 and lasso for L1.

除了广泛用于线性模型之外,L1 和 L2 正则化也经常用于神经网络和许多其他类型的模型,它们直接最小化目标函数。

In addition to being widely used with linear models, L1 and L2 regularization are also frequently used with neural networks and many other types of models, which directly minimize an objective function.

神经网络还受益于另外两种正则化技术:dropoutbatch-normalization。还有一些具有正则化效果的非数学方法:数据增强早期停止。我们将在第 8 章中讨论这些技术。

Neural networks also benefit from two other regularization techniques: dropout and batch-normalization. There are also non-mathematical methods that have a regularization effect: data augmentation and early stopping. We talk about these techniques in Chapter 8.

5.6模型性能评估

5.6 Model Performance Assessment

一旦你有了我们的学习算法使用训练集构建的模型,你怎么能说这个模型有多好呢?您使用测试集来评估模型。

Once you have a model which our learning algorithm has built using the training set, how can you say how good the model is? You use the test set to assess the model.

测试集包含学习算法以前从未见过的示例,因此如果我们的模型在预测测试集中示例的标签方面表现良好,我们就说我们的模型概括得很好,或者简单地说,它很好。

The test set contains the examples that the learning algorithm has never seen before, so if our model performs well on predicting the labels of the examples from the test set, we say that our model generalizes well or, simply, that it’s good.

为了更加严格,机器学习专家使用各种正式的指标和工具来评估模型的性能。对于回归,模型的评估非常简单。拟合良好的回归模型会产生接近观测数据值的预测值。如果没有信息特征,通常会使用平均模型,它总是预测训练数据中标签的平均值。因此,正在评估的回归模型的拟合应该优于平均模型的拟合。如果是这种情况,那么下一步就是比较模型在训练数据和测试数据上的性能。

To be more rigorous, machine learning specialists use various formal metrics and tools to assess the model performance. For regression, the assessment of the model is quite simple. A well-fitting regression model results in predicted values close to the observed data values. The mean model, which always predicts the average of the labels in the training data, generally would be used if there were no informative features. The fit of a regression model being assessed should, therefore, be better than the fit of the mean model. If this is the case, then the next step is to compare the performances of the model on the training and the test data.

为此,我们分别计算训练数据和测试数据的均方误差3 (MSE)。如果模型在测试数据上的 MSE大大高于在训练数据上获得的 MSE,则这是过度拟合的迹象。正则化或更好的超参数调整可以解决这个问题。 “显着更高”的含义取决于当前的问题,并且必须由数据分析师与订购模型的决策者/产品负责人共同决定。

To do that, we compute the mean squared error3 (MSE) for the training, and, separately, for the test data. If the MSE of the model on the test data is substantially higher than the MSE obtained on the training data, this is a sign of overfitting. Regularization or a better hyperparameter tuning could solve the problem. The meaning of “substantially higher” depends on the problem at hand and has to be decided by the data analyst jointly with the decision maker/product owner who ordered the model.

对于分类来说,事情有点复杂。用于评估分类模型的最广泛使用的指标和工具是:

For classification, things are a little bit more complicated. The most widely used metrics and tools to assess the classification model are:

  • 混淆矩阵,
  • confusion matrix,
  • 准确性,
  • accuracy,
  • 对成本敏感的准确性,
  • cost-sensitive accuracy,
  • 精确度/召回率,以及
  • precision/recall, and
  • ROC 曲线下的面积。
  • area under the ROC curve.

为了简化说明,我使用二元分类问题。如有必要,我将展示如何将该方法扩展到多类案例。

To simplify the illustration, I use a binary classification problem. Where necessary, I show how to extend the approach to the multiclass case.

5.6.1混淆矩阵

5.6.1 Confusion Matrix

混淆矩阵是一个表格,总结了分类模型在预测属于不同类别的示例方面的成功程度。混淆矩阵的一个轴是模型预测的标签,另一轴是实际标签。在二元分类问题中,有两个类。假设该模型预测两类:“spam”和“not_spam”:

The confusion matrix is a table that summarizes how successful the classification model is at predicting examples belonging to various classes. One axis of the confusion matrix is the label that the model predicted, and the other axis is the actual label. In a binary classification problem, there are two classes. Let’s say, the model predicts two classes: “spam” and “not_spam”:

垃圾邮件(预测) not_spam(预测)
垃圾邮件(实际) 23(TP) 1(前线)
not_spam(实际) 12(FP) 556(田纳西州)

上述混淆矩阵显示,在 24 个实际上是垃圾邮件的示例中,模型正确分类2323作为垃圾邮件。在这种情况下,我们说我们有2323 真阳性或 TP =2323。模型分类错误11例如 not_spam。在这种情况下,我们有11 假阴性,或 FN =11。同样,的第568章568实际上不是垃圾邮件的示例,第556章556被正确分类(第556章556 真阴性或 TN =第556章556), 和1212被错误分类(1212 误报, FP =1212)。

The above confusion matrix shows that of the 24 examples that actually were spam, the model correctly classified 2323 as spam. In this case, we say that we have 2323 true positives or TP = 2323. The model incorrectly classified 11 example as not_spam. In this case, we have 11 false negative, or FN = 11. Similarly, of 568568 examples that actually were not spam, 556556 were correctly classified (556556 true negatives or TN = 556556), and 1212 were incorrectly classified (1212 false positives, FP = 1212).

多类分类的混淆矩阵具有与不同类一样多的行和列。它可以帮助您确定错误模式。例如,混淆矩阵可以揭示,经过训练来识别不同种类动物的模型往往会错误地预测“猫”而不是“豹”,或者错误地预测“老鼠”而不是“大鼠”。在这种情况下,您可以决定添加更多这些物种的标记示例,以帮助学习算法“看到”它们之间的差异。或者,您可以添加学习算法可用于构建模型的其他功能,以更好地区分这些物种。

The confusion matrix for multiclass classification has as many rows and columns as there are different classes. It can help you to determine mistake patterns. For example, a confusion matrix could reveal that a model trained to recognize different species of animals tends to mistakenly predict “cat” instead of “panther,” or “mouse” instead of “rat.” In this case, you can decide to add more labeled examples of these species to help the learning algorithm to “see” the difference between them. Alternatively, you might add additional features the learning algorithm can use to build a model that would better distinguish between these species.

混淆矩阵用于计算另外两个性能指标:精度召回率

Confusion matrix is used to calculate two other performance metrics: precision and recall.

5.6.2精度/召回率

5.6.2 Precision/Recall

评估模型最常用的两个指标是精度召回率。精度是正确的阳性预测与阳性预测总数的比率:

The two most frequently used metrics to assess the model are precision and recall. Precision is the ratio of correct positive predictions to the overall number of positive predictions:

精确=定义TPTP+FP \text{precision} \stackrel{\text{def}}{=} \frac{\text{TP}}{\text{TP} + \text{FP}}.

precision=defTPTP+FP. \text{precision} \stackrel{\text{def}}{=} \frac{\text{TP}}{\text{TP} + \text{FP}}.

召回率是正确的正面预测与数据集中正面示例总数的比率:

Recall is the ratio of correct positive predictions to the overall number of positive examples in the dataset:

记起=定义TPTP+纤维网 \text{recall} \stackrel{\text{def}}{=} \frac{\text{TP}}{\text{TP} + \text{FN}}.

recall=defTPTP+FN. \text{recall} \stackrel{\text{def}}{=} \frac{\text{TP}}{\text{TP} + \text{FN}}.

为了理解精确度和召回率对于模型评估的意义和重要性,将预测问题视为使用查询研究数据库中的文档的问题通常是有用的。精度是相关文档在所有返回文档列表中所占的比例。查全率是搜索引擎返回的相关文档数与可能返回的相关文档总数的比值。

To understand the meaning and importance of precision and recall for the model assessment it is often useful to think about the prediction problem as the problem of research of documents in the database using a query. The precision is the proportion of relevant documents in the list of all returned documents. The recall is the ratio of the relevant documents returned by the search engine to the total number of the relevant documents that could have been returned.

在垃圾邮件检测问题的情况下,我们希望具有高精度(我们希望通过检测合法邮件是垃圾邮件来避免犯错误),并且我们准备容忍较低的召回率(我们容忍收件箱中的一些垃圾邮件)。

In the case of the spam detection problem, we want to have high precision (we want to avoid making mistakes by detecting that a legitimate message is spam) and we are ready to tolerate lower recall (we tolerate some spam messages in our inbox).

在实践中,我们几乎总是必须在高精度和高召回率之间做出选择。通常不可能两者兼得。我们可以通过多种方式实现两者中的任何一个:

Almost always, in practice, we have to choose between a high precision or a high recall. It’s usually impossible to have both. We can achieve either of the two by various means:

  • 通过为示例分配更高的权重(SVM 算法接受类的权重作为输入);
  • by assigning a higher weighting to the examples of (the SVM algorithm accepts weightings of classes as input);
  • 通过调整超参数以最大限度地提高验证集的精度或召回率;
  • by tuning hyperparameters to maximize precision or recall on the validation set;
  • 通过改变返回类别概率的算法的决策阈值;例如,如果我们使用逻辑回归或决策树来提高精度(以较低的召回率为代价),那么只有当模型返回的概率高于0.90.9
  • by varying the decision threshold for algorithms that return probabilities of classes; for instance, if we use logistic regression or decision tree, to increase precision (at the cost of a lower recall), we can decide that the prediction will be positive only if the probability returned by the model is higher than 0.90.9.

即使针对二元分类情况定义了精度和召回率,您也始终可以使用它来评估多类分类模型。为此,首先选择您想要评估这些指标的类别。然后,您将所选类别的所有示例视为正例,将其余类别的所有示例视为负例。

Even if precision and recall are defined for the binary classification case, you can always use it to assess a multiclass classification model. To do that, first select a class for which you want to assess these metrics. Then you consider all examples of the selected class as positives and all examples of the remaining classes as negatives.

5.6.3准确度

5.6.3 Accuracy

准确度由正确分类示例的数量除以分类示例的总数得出。就混淆矩阵而言,它由下式给出:

Accuracy is given by the number of correctly classified examples divided by the total number of classified examples. In terms of the confusion matrix, it is given by:

准确性=定义TP+总氮TP+总氮+FP+纤维网19 号 \text{accuracy} \stackrel{\text{def}}{=} \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}. \qquad(19)

accuracy=defTP+TNTP+TN+FP+FN.(19) \text{accuracy} \stackrel{\text{def}}{=} \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}. \qquad(19)

当预测所有类别的错误同样重要时,准确性是一个有用的指标。如果是垃圾邮件/非垃圾邮件,情况可能并非如此。例如,您对误报的容忍程度要低于对漏报的容忍程度。垃圾邮件检测中的误报是指您的朋友向您发送电子邮件,但模型将其标记为垃圾邮件并且不向您显示的情况。另一方面,误报并不是什么大问题:如果您的模型没有检测到一小部分垃圾邮件,那也没什么大不了的。

Accuracy is a useful metric when errors in predicting all classes are equally important. In case of the spam/not spam, this may not be the case. For example, you would tolerate false positives less than false negatives. A false positive in spam detection is the situation in which your friend sends you an email, but the model labels it as spam and doesn’t show you. On the other hand, the false negative is less of a problem: if your model doesn’t detect a small percentage of spam messages, it’s not a big deal.

图 21:ROC 曲线下方的面积(以灰色显示)。
图 21:ROC 曲线下方的面积(以灰色显示)。

5.6.4成本敏感的精度

5.6.4 Cost-Sensitive Accuracy

为了处理不同类别具有不同重要性的情况,一个有用的指标是成本敏感的准确性。要计算成本敏感的准确度,首先为两种类型的错误分配成本(正数):FP 和 FN。然后,您像往常一样计算计数 TP、TN、FP、FN,并将 FP 和 FN 的计数乘以相应的成本,然后再使用等式计算精度。  19 .

For dealing with the situation in which different classes have different importance, a useful metric is cost-sensitive accuracy. To compute a cost-sensitive accuracy, you first assign a cost (a positive number) to both types of mistakes: FP and FN. You then compute the counts TP, TN, FP, FN as usual and multiply the counts for FP and FN by the corresponding cost before calculating the accuracy using eq. 19.

5.6.5 ROC 曲线下面积 (AUC)

5.6.5 Area under the ROC Curve (AUC)

ROC 曲线(代表“接收器工作特性”;该术语来自雷达工程)是评估分类模型性能的常用方法。 ROC 曲线结合使用真阳性率(准确定义为召回率)和假阳性率(错误预测的负例的比例)来构建分类性能的摘要图。

The ROC curve (stands for “receiver operating characteristic;” the term comes from radar engineering) is a commonly used method to assess the performance of classification models. ROC curves use a combination of the true positive rate (defined exactly as recall) and false positive rate (the proportion of negative examples predicted incorrectly) to build up a summary picture of the classification performance.

真阳性率(TPR)和假阳性率(FPR)分别定义为:

The true positive rate (TPR) and the false positive rate (FPR) are respectively defined as,

热塑性弹性体=定义TPTP+纤维网 \text{TPR} \stackrel{\text{def}}{=} \frac{\text{TP}}{\text{TP} + \text{FN}}

TPR=defTPTP+FN \text{TPR} \stackrel{\text{def}}{=} \frac{\text{TP}}{\text{TP} + \text{FN}}

and

FPR=定义FPFP+总氮\text{FPR} \stackrel{\text{def}}{=} \frac{\text{FP}}{\text{FP} + \text{TN}}.

FPR=defFPFP+TN.\text{FPR} \stackrel{\text{def}}{=} \frac{\text{FP}}{\text{FP} + \text{TN}}.

ROC 曲线只能用于评估返回一些预测置信度得分(或概率)的分类器。例如,逻辑回归、神经网络和决策树(以及基于决策树的集成模型)可以使用 ROC 曲线进行评估。

ROC curves can only be used to assess classifiers that return some confidence score (or a probability) of prediction. For example, logistic regression, neural networks, and decision trees (and ensemble models based on decision trees) can be assessed using ROC curves.

要绘制 ROC 曲线,首先要离散化置信度分数的范围。如果模型的这个范围是[0,1][0,1],那么你可以像这样离散它:[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1][0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]。然后,您使用每个离散值作为预测阈值,并使用模型和此阈值预测数据集中示例的标签。例如,如果您要计算阈值的 TPR 和 FPR 等于0.70.7,将模型应用于每个示例,获取分数,如果分数高于或等于0.70.7,你预测正类;否则,您将预测负类。

To draw a ROC curve, you first discretize the range of the confidence score. If this range for a model is [0,1][0,1], then you can discretize it like this: [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1][0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]. Then, you use each discrete value as the prediction threshold and predict the labels of examples in your dataset using the model and this threshold. For example, if you want to compute TPR and FPR for the threshold equal to 0.70.7, you apply the model to each example, get the score, and, if the score is higher than or equal to 0.70.7, you predict the positive class; otherwise, you predict the negative class.

请看图 1 中的插图。  21 .很容易看出,如果阈值是00,我们所有的预测都是正的,所以 TPR 和 FPR 都是11(右上角)。另一方面,如果阈值是11,则不会做出正向预测,TPR 和 FPR 都会00对应于左下角。

Look at the illustration in fig. 21. It’s easy to see that if the threshold is 00, all our predictions will be positive, so both TPR and FPR will be 11 (the upper right corner). On the other hand, if the threshold is 11, then no positive prediction will be made, both TPR and FPR will be 00 which corresponds to the lower left corner.

ROC 曲线下面积(AUC)越高,分类器越好。 AUC 高于的分类器0.50.5比随机分类器更好。如果 AUC 低于0.50.5,那么你的模型有问题。一个完美的分类器的 AUC 为11。通常,如果您的模型表现良好,您可以通过选择使 TPR 接近的阈值来获得良好的分类器11同时保持 FPR 接近00

The higher the area under the ROC curve (AUC), the better the classifier. A classifier with an AUC higher than 0.50.5 is better than a random classifier. If AUC is lower than 0.50.5, then something is wrong with your model. A perfect classifier would have an AUC of 11. Usually, if your model behaves well, you obtain a good classifier by selecting the value of the threshold that gives TPR close to 11 while keeping FPR near 00.

ROC 曲线之所以受欢迎,是因为它们相对简单易懂,它们捕获了分类的多个方面(通过考虑误报和漏报),并且可以直观地轻松比较不同模型的性能。

ROC curves are popular because they are relatively simple to understand, they capture more than one aspect of the classification (by taking both false positives and negatives into account) and allow visually and with low effort comparing the performance of different models.

5.7超参数调优

5.7 Hyperparameter Tuning

当我介绍学习算法时,我提到作为数据分析师,你必须为算法的超参数选择好的值,例如ε\epsilondd对于ID3,CC对于支持向量机,或者α\alpha用于梯度下降。但这到底是什么意思呢?哪个值最好以及如何找到它?在本节中,我将回答这些基本问题。

When I presented learning algorithms, I mentioned that you as a data analyst have to select good values for the algorithm’s hyperparameters, such as ϵ\epsilon and dd for ID3, CC for SVM, or α\alpha for gradient descent. But what does that exactly mean? Which value is the best and how to find it? In this section, I answer these essential questions.

如您所知,超参数不会由学习算法本身进行优化。数据分析师必须通过实验找到最佳值组合(每个超参数一个)来“调整”超参数。

As you already know, hyperparameters aren’t optimized by the learning algorithm itself. The data analyst has to “tune” hyperparameters by experimentally finding the best combination of values, one per hyperparameter.

当您有足够的数据来拥有一个像样的验证集(其中每个类至少由几十个示例表示)并且超参数的数量及其范围不太大时,一种典型的方法是使用网格搜索

One typical way to do that, when you have enough data to have a decent validation set (in which each class is represented by at least a couple of dozen examples) and the number of hyperparameters and their range is not too large is to use grid search.

网格搜索是最简单的超参数调整技术。假设您训练一个 SVM,并且有两个超参数需要调整:惩罚参数CC(正实数)和内核(“线性”或“rbf”)。

Grid search is the most simple hyperparameter tuning technique. Let’s say you train an SVM and you have two hyperparameters to tune: the penalty parameter CC (a positive real number) and the kernel (either “linear” or “rbf”).

如果这是您第一次使用这个特定的数据集,您不知道可能的值范围是多少CC。最常见的技巧是使用对数刻度。例如,对于CC您可以尝试以下值:[0.001, 0.01, 0.1, 1, 10, 100, 1000]。在这种情况下,您有 14 种超参数组合可供尝试:[(0.001, “线性”), (0.01, “线性”), (0.1, “线性”), (1, “线性”), (10, “线性”) ”), (100, “线性”), (1000, “线性”), (0.001, “rbf”), (0.01, “rbf”), (0.1, “rbf”), (1, “rbf”) ,(10,“rbf”),(100,“rbf”),(1000,“rbf”)]。

If it’s the first time you are working with this particular dataset, you don’t know what is the possible range of values for CC. The most common trick is to use a logarithmic scale. For example, for CC you can try the following values: [0.001, 0.01, 0.1, 1, 10, 100, 1000]. In this case you have 14 combinations of hyperparameters to try: [(0.001, “linear”), (0.01, “linear”), (0.1, “linear”), (1, “linear”), (10, “linear”), (100, “linear”), (1000, “linear”), (0.001, “rbf”), (0.01, “rbf”), (0.1, “rbf”), (1, “rbf”), (10, “rbf”), (100, “rbf”), (1000, “rbf”)].

您使用训练集并训练 14 个模型,每个模型对应超参数的一种组合。然后,您使用我们在上一节中讨论的指标之一(或对您重要的其他一些指标)评估每个模型在验证数据上的性能。最后,您保留根据指标表现最佳的模型。

You use the training set and train 14 models, one for each combination of hyperparameters. Then you assess the performance of each model on the validation data using one of the metrics we discussed in the previous section (or some other metric that matters to you). Finally, you keep the model that performs the best according to the metric.

一旦找到最佳的超参数对,您就可以尝试探索其周围某些区域中接近最佳值的值。有时,这可以产生更好的模型。

Once the best pair of hyperparameters is found, you can try to explore the values close to the best ones in some region around them. Sometimes, this can result in an even better model.

最后,您使用测试集评估所选模型。

Finally, you assess the selected model using the test set.

您可能会注意到,尝试所有超参数组合(尤其是多个超参数组合)可能非常耗时,尤其是对于大型数据集。还有更有效的技术,例如随机搜索贝叶斯超参数优化

As you could notice, trying all combinations of hyperparameters, especially if there are more than a couple of them, could be time-consuming, especially for large datasets. There are more efficient techniques, such as random search and Bayesian hyperparameter optimization.

随机搜索与网格搜索的不同之处在于,您不再提供一组离散的值来探索每个超参数;相反,您为每个超参数提供一个统计分布,从中随机采样值并设置要尝试的组合总数。

Random search differs from grid search in that you no longer provide a discrete set of values to explore for each hyperparameter; instead, you provide a statistical distribution for each hyperparameter from which values are randomly sampled and set the total number of combinations you want to try.

贝叶斯技术与随机或网格搜索的不同之处在于,贝叶斯技术使用过去的评估结果来选择下一个要评估的值。这个想法是通过基于过去表现良好的超参数值来选择下一个超参数值,从而限制目标函数昂贵的优化次数。

Bayesian techniques differ from random or grid search in that they use past evaluation results to choose the next values to evaluate. The idea is to limit the number of expensive optimizations of the objective function by choosing the next hyperparameter values based on those that have done well in the past.

还有基于梯度的技术进化优化技术和其他算法超参数调整技术。大多数现代机器学习库都实现了一种或多种此类技术。还有超参数调整库,可以帮助您调整几乎任何学习算法的超参数,包括您自己编程的算法。

There are also gradient-based techniques, evolutionary optimization techniques, and other algorithmic hyperparameter tuning techniques. Most modern machine learning libraries implement one or more such techniques. There are also hyperparameter tuning libraries that can help you to tune hyperparameters of virtually any learning algorithm, including ones you programmed yourself.

5.7.1交叉验证

5.7.1 Cross-Validation

当您没有合适的验证集来调整超参数时,可以帮助您的常用技术称为交叉验证。当训练样本很少时,同时拥有验证集和测试集可能会令人望而却步。您更愿意使用更多数据来训练模型。在这种情况下,您只需将数据分为训练集和测试集。然后,您对训练集使用交叉验证来模拟验证集。

When you don’t have a decent validation set to tune your hyperparameters on, the common technique that can help you is called cross-validation. When you have few training examples, it could be prohibitive to have both validation and test set. You would prefer to use more data to train the model. In such a case, you only split your data into a training and a test set. Then you use cross-validation on the training set to simulate a validation set.

交叉验证的工作原理如下。首先,修复要评估的超参数的值。然后将训练集分成几个相同大小的子集。每个子集称为一个折叠。通常,实践中使用五折交叉验证。通过五折交叉验证,您可以将训练数据随机分为五折:{F1,F2,……,F5}\{F_1,F_2,\ldots,F_5\}。每个FkF_k,k=1,……,5k=1,\ldots,5包含 20% 的训练数据。然后按如下方式训练五个模型。为了训练第一个模型,F1f_1,您使用折叠中的所有示例F2F_2,F3F_3,F4F_4, 和F5F_5作为训练集和示例F1F_1作为验证集。为了训练第二个模型,F2f_2,您使用折叠中的示例F1F_1,F3F_3,F4F_4, 和F5F_5训练和示例F2F_2作为验证集。您继续像这样迭代地构建模型,并计算每个验证集上感兴趣的指标的值,从F1F_1F5F_5。然后,对指标的五个值进行平均以获得最终值。

Cross-validation works as follows. First, you fix the values of the hyperparameters you want to evaluate. Then you split your training set into several subsets of the same size. Each subset is called a fold. Typically, five-fold cross-validation is used in practice. With five-fold cross-validation, you randomly split your training data into five folds: {F1,F2,,F5}\{F_1,F_2,\ldots,F_5\}. Each FkF_k, k=1,,5k=1,\ldots,5 contains 20% of your training data. Then you train five models as follows. To train the first model, f1f_1, you use all examples from folds F2F_2, F3F_3, F4F_4, and F5F_5 as the training set and the examples from F1F_1 as the validation set. To train the second model, f2f_2, you use the examples from folds F1F_1, F3F_3, F4F_4, and F5F_5 to train and the examples from F2F_2 as the validation set. You continue building models iteratively like this and compute the value of the metric of interest on each validation set, from F1F_1 to F5F_5. Then you average the five values of the metric to get the final value.

您可以使用网格搜索和交叉验证来查找模型的最佳超参数值。找到这些值后,您可以使用整个训练集来构建模型,其中包含通过交叉验证找到的超参数的最佳值。最后,您使用测试集评估模型。

You can use grid search with cross-validation to find the best values of hyperparameters for your model. Once you have found these values, you use the entire training set to build the model with these best values of hyperparameters you have found via cross-validation. Finally, you assess the model using the test set.


  1. 当某些分类变量的值的顺序很重要时,我们可以通过仅保留一个变量来用数字替换这些值。例如,如果我们的变量代表文章的质量,并且值是{pr,deCent,Gd,eXCeent}\{poor, decent, good, excellent\},然后我们可以用数字替换这些类别,例如,{1,2,3,4}\{1,2,3,4\}

  2. When the ordering of values of some categorical variable matters, we can replace those values by numbers by keeping only one variable. For example, if our variable represents the quality of an article, and the values are {poor,decent,good,excellent}\{poor, decent, good, excellent\}, then we could replace those categories by numbers, for example, {1,2,3,4}\{1,2,3,4\}.

  3. 现代库中实现的 kNN 和集成方法的预测速度仍然相当快。不要害怕在实践中使用这些算法。

  4. The prediction speed of kNN and ensemble methods implemented in the modern libraries are still pretty fast. Don’t be afraid of using these algorithms in your practice.

  5. 或任何其他类型的有意义的平均损失函数。

  6. Or any other type of average loss function that makes sense.

6神经网络和深度学习

6 Neural Networks and Deep Learning

首先,你已经知道什么是神经网络,并且你已经知道如何构建这样的模型。是的,这就是逻辑回归!事实上,逻辑回归模型,或者更确切地说,它对多类分类的推广,称为 softmax 回归模型,是神经网络中的标准单元。

First of all, you already know what a neural network is, and you already know how to build such a model. Yes, it’s logistic regression! As a matter of fact, the logistic regression model, or rather its generalization for multiclass classification, called the softmax regression model, is a standard unit in a neural network.

6.1神经网络

6.1 Neural Networks

如果您了解线性回归、逻辑回归和梯度下降,那么理解神经网络应该不成问题。

If you understood linear regression, logistic regression, and gradient descent, understanding neural networks should not be a problem.

神经网络 (NN),就像回归或 SVM 模型一样,是一个数学函数:

A neural network (NN), just like a regression or an SVM model, is a mathematical function:

y=F𝐱 y = f_{NN}(\mathbf{x}).

y=fNN(𝐱). y = f_{NN}(\mathbf{x}).

功能Ff_{NN}有一种特殊的形式:它是一个嵌套函数。您可能已经听说过神经网络。因此,对于返回标量的 3 层神经网络,Ff_{NN}看起来像这样:

The function fNNf_{NN} has a particular form: it’s a nested function. You have probably already heard of neural network layers. So, for a 3-layer neural network that returns a scalar, fNNf_{NN} looks like this:

y=F𝐱=F3𝐟2𝐟1𝐱y = f_{NN}(\mathbf{x}) = f_3(\bm{f}_2(\bm{f}_1(\mathbf{x}))).

y=fNN(𝐱)=f3(𝐟2(𝐟1(𝐱))).y = f_{NN}(\mathbf{x}) = f_3(\bm{f}_2(\bm{f}_1(\mathbf{x}))).

在上式中,𝐟1\bm{f}_1𝐟2\bm{f}_2是以下形式的向量函数:

In the above equation, 𝐟1\bm{f}_1 and 𝐟2\bm{f}_2 are vector functions of the following form:

𝐟𝐳=定义𝐠𝐖𝐳+𝐛,20 \bm{f}_l(\mathbf{z}) \stackrel{\text{def}}{=} \bm{g}_l(\mathbf{W}_l\mathbf{z} + \mathbf{b}_l), \qquad(20)

𝐟l(𝐳)=def𝐠l(𝐖l𝐳+𝐛l),(20) \bm{f}_l(\mathbf{z}) \stackrel{\text{def}}{=} \bm{g}_l(\mathbf{W}_l\mathbf{z} + \mathbf{b}_l), \qquad(20)

在哪里l称为图层索引,可以跨越11到任意数量的层。功能𝐠\bm{g}_l称为激活函数。它是数据分析师在学习开始之前选择的固定的、通常是非线性的函数。参数𝐖\mathbf{W}_l(矩阵)和𝐛\mathbf{b}_l通过根据任务优化特定的成本函数(例如 MSE),使用熟悉的梯度下降来学习每一层的(向量)。比较等式。  20使用逻辑回归方程,将其替换为𝐠\bm{g}_l通过 sigmoid 函数,你不会看到任何差异。功能F3f_3是回归任务的标量函数,但也可以是向量函数,具体取决于您的问题。

where ll is called the layer index and can span from 11 to any number of layers. The function 𝐠l\bm{g}_l is called an activation function. It is a fixed, usually nonlinear function chosen by the data analyst before the learning is started. The parameters 𝐖l\mathbf{W}_l (a matrix) and 𝐛l\mathbf{b}_l (a vector) for each layer are learned using the familiar gradient descent by optimizing, depending on the task, a particular cost function (such as MSE). Compare eq. 20 with the equation for logistic regression, where you replace 𝐠l\bm{g}_l by the sigmoid function, and you will not see any difference. The function f3f_3 is a scalar function for the regression task, but can also be a vector function depending on your problem.

您可能想知道为什么矩阵𝐖\mathbf{W}_l被使用而不是向量𝐰\mathbf{w}_l。原因是𝐠\bm{g}_l是一个向量函数。每一行𝐰,\mathbf{w}_{l,u}u为矩阵的单位)𝐖\mathbf{W}_l是与以下维度相同的向量𝐳\mathbf{z}。让A,=𝐰,𝐳+,a_{l,u} = \mathbf{w}_{l,u}\mathbf{z} + b_{l,u}。的输出𝐟𝐳\bm{f}_l(\mathbf{z})是一个向量[GA,1,GA,2,……,GA,sze][g_l(a_{l,1}), g_l(a_{l,2}),\ldots, g_l(a_{l,size_l})], 在哪里Gg_l是某个标量函数1,并且szesize_l是层中的单元数l。为了使其更具体,让我们考虑一种称为多层感知器(通常称为普通神经网络)的神经网络架构。

You may probably wonder why a matrix 𝐖l\mathbf{W}_l is used and not a vector 𝐰l\mathbf{w}_l. The reason is that 𝐠l\bm{g}_l is a vector function. Each row 𝐰l,u\mathbf{w}_{l,u} (uu for unit) of the matrix 𝐖l\mathbf{W}_l is a vector of the same dimensionality as 𝐳\mathbf{z}. Let al,u=𝐰l,u𝐳+bl,ua_{l,u} = \mathbf{w}_{l,u}\mathbf{z} + b_{l,u}. The output of 𝐟l(𝐳)\bm{f}_l(\mathbf{z}) is a vector [gl(al,1),gl(al,2),,gl(al,sizel)][g_l(a_{l,1}), g_l(a_{l,2}),\ldots, g_l(a_{l,size_l})], where glg_l is some scalar function1, and sizelsize_l is the number of units in layer ll. To make it more concrete, let’s consider one architecture of neural networks called multilayer perceptron and often referred to as a vanilla neural network.

6.1.1多层感知器示例

6.1.1 Multilayer Perceptron Example

我们仔细研究了一种称为前馈神经网络(FFNN) 的特定神经网络配置,更具体地说是称为多层感知器(MLP)的架构。作为说明,让我们考虑一个具有三层的 MLP。我们的网络采用二维特征向量作为输入并输出一个数字。该 FFNN 可以是回归模型或分类模型,具体取决于第三输出层中使用的激活函数。

We have a closer look at one particular configuration of neural networks called feed-forward neural networks (FFNN), and more specifically the architecture called a multilayer perceptron (MLP). As an illustration, let’s consider an MLP with three layers. Our network takes a two-dimensional feature vector as input and outputs a number. This FFNN can be a regression or a classification model, depending on the activation function used in the third, output layer.

我们的 MLP 如下所示。

Our MLP is depicted below.

图 22:具有二维输入的多层感知器,两层有四个单元,一个输出层有一个单元。
图 22:具有二维输入的多层感知器,两层有四个单元,一个输出层有一个单元。

神经网络以图形方式表示为逻辑上组织为一层或多层的单元连接组合。每个单元由圆形或矩形表示。入站箭头表示单元的输入并指示该输入的来源。出站箭头表示单元的输出。

The neural network is represented graphically as a connected combination of units logically organized into one or more layers. Each unit is represented by either a circle or a rectangle. The inbound arrow represents an input of a unit and indicates where this input came from. The outbound arrow indicates the output of a unit.

每个单元的输出是写在矩形内的数学运算的结果。圆形单位不会对输入执行任何操作;他们只是将输入直接发送到输出。

The output of each unit is the result of the mathematical operation written inside the rectangle. Circle units don’t do anything with the input; they just send their input directly to the output.

每个矩形单元中都会发生以下情况。首先,将单元的所有输入连接在一起以形成输入向量。然后,该单元对输入向量应用线性变换,就像线性回归模型对其输入特征向量所做的那样。最后,该单元应用激活函数Gg线性变换的结果并获得输出值,一个实数。在普通 FFNN 中,某层单元的输出值成为后续层每个单元的输入值。

The following happens in each rectangle unit. Firstly, all inputs of the unit are joined together to form an input vector. Then the unit applies a linear transformation to the input vector, exactly like linear regression model does with its input feature vector. Finally, the unit applies an activation function gg to the result of the linear transformation and obtains the output value, a real number. In a vanilla FFNN, the output value of a unit of some layer becomes an input value of each of the units of the subsequent layer.

在图中。  22、激活函数Gg_l有一个索引:l,单元所属层的索引。通常,一层的所有单元都使用相同的激活函数,但这不是规则。每层可以有不同数量的单元。每个单元都有其参数𝐰,\mathbf{w}_{l,u},b_{l,u}, 在哪里u是单位的索引,并且l是图层的索引。向量𝐲-1\mathbf{y}_{l-1}每个单元中的定义为[y-11,y-12,y-13,y-14][y^{(1)}_{l-1},y^{(2)}_{l-1},y^{(3)}_{l-1},y^{(4)}_{l-1}]。向量𝐱\mathbf{x}在第一层中定义为[X1,……,XD][x^{(1)},\ldots,x^{(D)}]

In fig. 22, the activation function glg_l has one index: ll, the index of the layer the unit belongs to. Usually, all units of a layer use the same activation function, but it’s not a rule. Each layer can have a different number of units. Each unit has its parameters 𝐰l,u\mathbf{w}_{l,u} and bl,ub_{l,u}, where uu is the index of the unit, and ll is the index of the layer. The vector 𝐲l1\mathbf{y}_{l-1} in each unit is defined as [yl1(1),yl1(2),yl1(3),yl1(4)][y^{(1)}_{l-1},y^{(2)}_{l-1},y^{(3)}_{l-1},y^{(4)}_{l-1}]. The vector 𝐱\mathbf{x} in the first layer is defined as [x(1),,x(D)][x^{(1)},\ldots,x^{(D)}].

如图所示。 如图22所示,在多层感知器中,一层的所有输出都连接到后续层的每个输入。这种架构称为全连接。神经网络可以包含全连接层。这些层的单元接收前一层每个单元的输出作为输入。

As you can see in fig. 22, in multilayer perceptron all outputs of one layer are connected to each input of the succeeding layer. This architecture is called fully-connected. A neural network can contain fully-connected layers. Those are the layers whose units receive as inputs the outputs of each of the units of the previous layer.

6.1.2前馈神经网络架构

6.1.2 Feed-Forward Neural Network Architecture

如果我们想要解决前面章节中讨论的回归或分类问题,神经网络的最后(最右边)层通常只包含一个单元。如果激活函数GAstg_{last}最后一个单元是线性的,则神经网络是回归模型。如果GAstg_{last}是逻辑函数,神经网络是二元分类模型。

If we want to solve a regression or a classification problem discussed in previous chapters, the last (the rightmost) layer of a neural network usually contains only one unit. If the activation function glastg_{last} of the last unit is linear, then the neural network is a regression model. If the glastg_{last} is a logistic function, the neural network is a binary classification model.

数据分析师可以选择任何数学函数作为G,g_{l,u},假设它是可微的2。后一个属性对于用于查找参数值的梯度下降至关重要𝐰,\mathbf{w}_{l,u},b_{l,u}对全部lu。函数中具有非线性分量的主要目的Ff_{NN}就是让神经网络去逼近非线性函数。没有非线性,Ff_{NN}无论有多少层,都是线性的。原因是𝐖𝐳+𝐛\mathbf{W}_l\mathbf{z} + \mathbf{b}_l是线性函数,线性函数的线性函数也是线性的。

The data analyst can choose any mathematical function as gl,ug_{l,u}, assuming it’s differentiable2. The latter property is essential for gradient descent used to find the values of the parameters 𝐰l,u\mathbf{w}_{l,u} and bl,ub_{l,u} for all ll and uu. The primary purpose of having nonlinear components in the function fNNf_{NN} is to allow the neural network to approximate nonlinear functions. Without nonlinearities, fNNf_{NN} would be linear, no matter how many layers it has. The reason is that 𝐖l𝐳+𝐛l\mathbf{W}_l\mathbf{z} + \mathbf{b}_l is a linear function and a linear function of a linear function is also linear.

激活函数的流行选择是您已知的逻辑函数,以及TanHReLU。前者是双曲正切函数,类似于逻辑函数,但范围为-1-111(没有到达他们)。后者是修正的线性单位函数,当其输入为零时zz是负数并且zz否则:

Popular choices of activation functions are the logistic function, already known to you, as well as TanH and ReLU. The former is the hyperbolic tangent function, similar to the logistic function but ranging from 1-1 to 11 (without reaching them). The latter is the rectified linear unit function, which equals to zero when its input zz is negative and to zz otherwise:

tAnHz=ez-e-zez+e-z, tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}, rez={0如果z<0z否则 relu(z) = \begin{cases}0&{\mbox{if }}z<0\\z&{\mbox{otherwise}}\end{cases}.

tanh(z)=ezezez+ez, tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}, relu(z)={0if z<0zotherwise. relu(z) = \begin{cases}0&{\mbox{if }}z<0\\z&{\mbox{otherwise}}\end{cases}.

正如我上面所说,𝐖\mathbf{W}_l在表达式中𝐖𝐳+𝐛\mathbf{W}_l\mathbf{z} + \mathbf{b}_l,是一个矩阵,而𝐛\mathbf{b}_l是一个向量。这看起来与线性回归不同𝐰𝐳+\mathbf{w}\mathbf{z} + b。在矩阵中𝐖\mathbf{W}_l, 每一行u对应于参数向量𝐰,\mathbf{w}_{l,u}。向量的维数𝐰,\mathbf{w}_{l,u}等于层中的单元数-1l-1。操作𝐖𝐳\mathbf{W}_l\mathbf{z}结果为向量𝐚=定义[𝐰,1𝐳,𝐰,2𝐳,……,𝐰,sze𝐳]\mathbf{a}_l \stackrel{\text{def}}{=} [\mathbf{w}_{l,1}\mathbf{z},\mathbf{w}_{l,2}\mathbf{z}, \ldots, \mathbf{w}_{l,size_l}\mathbf{z}]。然后求和𝐚+𝐛\mathbf{a}_l + \mathbf{b}_l给出一个szesize_l维向量𝐜\mathbf{c}_l。最后,函数𝐠𝐜\bm{g}_l(\mathbf{c}_l)产生向量𝐲=定义[y1,y2,……,ysze]\mathbf{y}_l \stackrel{\text{def}}{=} [y_l^{(1)}, y_l^{(2)}, \ldots, y_l^{(size_l)}]作为输出。

As I said above, 𝐖l\mathbf{W}_l in the expression 𝐖l𝐳+𝐛l\mathbf{W}_l\mathbf{z} + \mathbf{b}_l, is a matrix, while 𝐛l\mathbf{b}_l is a vector. That looks different from linear regression’s 𝐰𝐳+b\mathbf{w}\mathbf{z} + b. In matrix 𝐖l\mathbf{W}_l, each row uu corresponds to a vector of parameters 𝐰l,u\mathbf{w}_{l,u}. The dimensionality of the vector 𝐰l,u\mathbf{w}_{l,u} equals to the number of units in the layer l1l-1. The operation 𝐖l𝐳\mathbf{W}_l\mathbf{z} results in a vector 𝐚l=def[𝐰l,1𝐳,𝐰l,2𝐳,,𝐰l,sizel𝐳]\mathbf{a}_l \stackrel{\text{def}}{=} [\mathbf{w}_{l,1}\mathbf{z},\mathbf{w}_{l,2}\mathbf{z}, \ldots, \mathbf{w}_{l,size_l}\mathbf{z}]. Then the sum 𝐚l+𝐛l\mathbf{a}_l + \mathbf{b}_l gives a sizelsize_l-dimensional vector 𝐜l\mathbf{c}_l. Finally, the function 𝐠l(𝐜l)\bm{g}_l(\mathbf{c}_l) produces the vector 𝐲l=def[yl(1),yl(2),,yl(sizel)]\mathbf{y}_l \stackrel{\text{def}}{=} [y_l^{(1)}, y_l^{(2)}, \ldots, y_l^{(size_l)}] as output.

6.2深度学习

6.2 Deep Learning

深度学习是指训练具有两个以上非输出层的神经网络。过去,随着层数的增加,训练此类网络变得更加困难。使用梯度下降来训练网络参数时,两个最大的挑战被称为梯度爆炸梯度消失问题。

Deep learning refers to training neural networks with more than two non-output layers. In the past, it became more difficult to train such networks as the number of layers grew. The two biggest challenges were referred to as the problems of exploding gradient and vanishing gradient as gradient descent was used to train the network parameters.

虽然通过应用梯度裁剪和 L1 或 L2 正则化等简单技术可以更轻松地处理梯度爆炸问题,但梯度消失问题几十年来仍然难以解决。

While the problem of exploding gradient was easier to deal with by applying simple techniques like gradient clipping and L1 or L2 regularization, the problem of vanishing gradient remained intractable for decades.

什么是梯度消失以及为什么会出现梯度消失?为了更新神经网络中的参数值,通常使用称为反向传播的算法。反向传播是一种使用链式法则计算神经网络梯度的有效算法。在第 4 章中,我们已经了解了如何使用链式法则来计算复函数的偏导数。在梯度下降期间,神经网络的参数接收与每次训练迭代中成本函数相对于当前参数的偏导数成比例的更新。问题在于,在某些情况下,梯度会非常小,从而有效地阻止了某些参数改变其值。在最坏的情况下,这可能会完全阻止神经网络进一步训练。

What is vanishing gradient and why does it arise? To update the values of the parameters in neural networks the algorithm called backpropagation is typically used. Backpropagation is an efficient algorithm for computing gradients on neural networks using the chain rule. In Chapter 4, we have already seen how the chain rule is used to calculate partial derivatives of a complex function. During gradient descent, the neural network’s parameters receive an update proportional to the partial derivative of the cost function with respect to the current parameter in each iteration of training. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing some parameters from changing their value. In the worst case, this may completely stop the neural network from further training.

传统的激活函数,例如我上面提到的双曲正切函数,其梯度范围为0,1(0,1),反向传播通过链式法则计算梯度。这样就有倍增的效果nn这些小数字的值来计算早期(最左边)层的梯度nn-层网络,这意味着梯度呈指数下降nn。这会导致较早的层训练非常缓慢(如果有的话)。

Traditional activation functions, such as the hyperbolic tangent function I mentioned above, have gradients in the range (0,1)(0,1), and backpropagation computes gradients by the chain rule. That has the effect of multiplying nn of these small numbers to compute gradients of the earlier (leftmost) layers in an nn-layer network, meaning that the gradient decreases exponentially with nn. That results in the effect that the earlier layers train very slowly, if at all.

然而,神经网络学习算法的现代实现允许您有效地训练非常深的神经网络(最多数百层)。这是由于多项改进的结合,包括 ReLU、LSTM(以及其他门控单元;我们在下面考虑它们),以及残差神经网络中使用的跳跃连接等技术,以及梯度下降算法的高级修改。

However, the modern implementations of neural network learning algorithms allow you to effectively train very deep neural networks (up to hundreds of layers). This is due to several improvements combined together, including ReLU, LSTM (and other gated units; we consider them below), as well as techniques such as skip connections used in residual neural networks, as well as advanced modifications of the gradient descent algorithm.

因此,今天,由于梯度消失和爆炸的问题在很大程度上得到了解决(或者其影响减弱),“深度学习”一词是指使用现代算法和数学工具包来训练神经网络,而与神经网络的深度无关。网络群岛。在实践中,许多业务问题可以通过输入层和输出层之间有 2-3 层的神经网络来解决。既不是输入也不是输出的层通常称为隐藏层

Therefore, today, since the problems of vanishing and exploding gradient are mostly solved (or their effect diminished) to a great extent, the term “deep learning” refers to training neural networks using the modern algorithmic and mathematical toolkit independently of how deep the neural network is. In practice, many business problems can be solved with neural networks having 2-3 layers between the input and output layers. The layers that are neither input nor output are often called hidden layers.

6.2.1卷积神经网络

6.2.1 Convolutional Neural Network

您可能已经注意到,随着网络规模的扩大,MLP 的参数数量会增长得非常快。更具体地说,当您添加一层时,您会添加sze-1+1sze(size_{l-1} + 1)\cdot size_{l}参数(我们的矩阵𝐖\mathbf{W}_l加上向量𝐛\mathbf{b}_l)。这意味着,如果您向现有神经网络添加另一个 1000 个单元的层,那么您就会向模型添加超过 100 万个额外参数。优化如此大的模型是一个计算量非常大的问题。

You may have noticed that the number of parameters an MLP can have grows very fast as you make your network bigger. More specifically, as you add one layer, you add (sizel1+1)sizel(size_{l-1} + 1)\cdot size_{l} parameters (our matrix 𝐖l\mathbf{W}_l plus the vector 𝐛l\mathbf{b}_l). That means that if you add another 1000-unit layer to an existing neural network, then you add more than 1 million additional parameters to your model. Optimizing such big models is a very computationally intensive problem.

当我们的训练样本是图像时,输入的维度非常高3。如果你想学习使用 MLP 对图像进行分类,优化问题可能会变得棘手。

When our training examples are images, the input is very high-dimensional3. If you want to learn to classify images using an MLP, the optimization problem is likely to become intractable.

卷积神经网络(CNN)是一种特殊的 FFNN,它显着减少了具有许多单元的深度神经网络中的参数数量,而不会损失太多模型的质量。 CNN 已在图像和文本处理领域得到应用,并且超越了许多先前建立的基准。

A convolutional neural network (CNN) is a special kind of FFNN that significantly reduces the number of parameters in a deep neural network with many units without losing too much in the quality of the model. CNNs have found applications in image and text processing where they beat many previously established benchmarks.

由于 CNN 是在考虑图像处理的情况下发明的,因此我在图像分类示例中对其进行了解释。

Because CNNs were invented with image processing in mind, I explain them on the image classification example.

您可能已经注意到,在图像中,彼此接近的像素通常代表相同类型的信息:天空、水、树叶、毛皮、砖块等。该规则的例外是边缘:图像中两个不同对象相互“接触”的部分。

You may have noticed that in images, pixels that are close to one another usually represent the same type of information: sky, water, leaves, fur, bricks, and so on. The exception from the rule are the edges: the parts of an image where two different objects “touch” one another.

如果我们可以训练神经网络来识别具有相同信息的区域以及边缘,那么这些知识将允许神经网络预测图像中表示的对象。例如,如果神经网络检测到多个皮肤区域和边缘看起来像椭圆形的一部分,内部色调类似皮肤,外部色调偏蓝色,那么它很可能是天空背景上的一张脸。如果我们的目标是检测图片上的人,神经网络很可能会成功预测图片中的人。

If we can train the neural network to recognize regions of the same information as well as the edges, this knowledge would allow the neural network to predict the object represented in the image. For example, if the neural network detected multiple skin regions and edges that look like parts of an oval with skin-like tone on the inside and bluish tone on the outside, then it is likely that it’s a face on the sky background. If our goal is to detect people on pictures, the neural network will most likely succeed in predicting a person in this picture.

考虑到图像中最重要的信息是局部的,我们可以使用移动窗口方法将图像分割成方形块4。然后,我们可以一次训练多个较小的回归模型,每个小回归模型接收一个正方形补丁作为输入。每个小型回归模型的目标是学习检测输入补丁中的特定类型模式。例如,一个小型回归模型将学习检测天空;另一个将检测草地,第三个将检测建筑物的边缘,依此类推。

Having in mind that the most important information in the image is local, we can split the image into square patches using a moving window approach4. We can then train multiple smaller regression models at once, each small regression model receiving a square patch as input. The goal of each small regression model is to learn to detect a specific kind of pattern in the input patch. For example, one small regression model will learn to detect the sky; another one will detect the grass, the third one will detect edges of a building, and so on.

在 CNN 中,小型回归模型如图 1 所示。  22,但它只有一层11并且没有层次2233。为了检测某些模式,小型回归模型必须学习矩阵的参数𝐅\mathbf{F}(“过滤器”)尺寸p×pp \times p, 在哪里pp是补丁的大小。为简单起见,我们假设输入图像是黑白的,11代表黑色和00代表白色像素。还假设我们的补丁是33经过33像素 (p=3p = 3)。某些补丁可能类似于以下矩阵𝐏\mathbf{P}(对于“补丁”):

In CNNs, a small regression model looks like the one in fig. 22, but it only has the layer 11 and doesn’t have layers 22 and 33. To detect some pattern, a small regression model has to learn the parameters of a matrix 𝐅\mathbf{F} (for “filter”) of size p×pp \times p, where pp is the size of a patch. Let’s assume, for simplicity, that the input image is black and white, with 11 representing black and 00 representing white pixels. Assume also that our patches are 33 by 33 pixels (p=3p = 3). Some patch could then look like the following matrix 𝐏\mathbf{P} (for “patch”):

𝐏=[010111010] \mathbf{P} = \left[ {\begin{array}{ccc} 0 & 1 & 0\\ 1 & 1 & 1\\ 0 & 1 & 0\\ \end{array} } \right].

𝐏=[010111010]. \mathbf{P} = \left[ {\begin{array}{ccc} 0 & 1 & 0\\ 1 & 1 & 1\\ 0 & 1 & 0\\ \end{array} } \right].

上面的补丁代表了一个看起来像十字的图案。将检测此类模式(并且仅检测它们)的小型回归模型需要学习33经过33参数矩阵𝐅\mathbf{F}其中参数对应的位置11输入补丁中的 s 将为正数,而对应位置的参数00s 将接近于零。如果我们计算矩阵的卷积𝐏\mathbf{P}𝐅\mathbf{F},越相似我们获得的值就越高𝐅\mathbf{F}是为了𝐏\mathbf{P}。为了说明两个矩阵的卷积,假设𝐅\mathbf{F}看起来像这样:𝐅=[023241030] \mathbf{F} = \left[ {\begin{array}{ccc} 0 & 2 & 3\\ 2 & 4 & 1\\ 0 & 3 & 0\\ \end{array} } \right].

The above patch represents a pattern that looks like a cross. The small regression model that will detect such patterns (and only them) would need to learn a 33 by 33 parameter matrix 𝐅\mathbf{F} where parameters at positions corresponding to the 11s in the input patch would be positive numbers, while the parameters in positions corresponding to 00s would be close to zero. If we calculate the convolution of matrices 𝐏\mathbf{P} and 𝐅\mathbf{F}, the value we obtain is higher the more similar 𝐅\mathbf{F} is to 𝐏\mathbf{P}. To illustrate the convolution of two matrices, assume that 𝐅\mathbf{F} looks like this: 𝐅=[023241030]. \mathbf{F} = \left[ {\begin{array}{ccc} 0 & 2 & 3\\ 2 & 4 & 1\\ 0 & 3 & 0\\ \end{array} } \right].

然后卷积\operatorname{convolution}运算符仅针对具有相同行数和列数的矩阵定义。对于我们的矩阵𝐏\mathbf{P}𝐅\mathbf{F}其计算方法如下图所示:

Then convolution\operatorname{convolution} operator is only defined for matrices that have the same number of rows and columns. For our matrices of 𝐏\mathbf{P} and 𝐅\mathbf{F} it’s calculated as illustrated below:

图 23:两个矩阵之间的卷积。
图 23:两个矩阵之间的卷积。

如果我们的输入补丁𝐏\mathbf{P}有不同的模式,例如字母 L 的模式,𝐏=[100100111], \mathbf{P} = \left[ {\begin{array}{ccc} 1 & 0 & 0\\ 1 & 0 & 0\\ 1 & 1 & 1\\ \end{array} } \right],

If our input patch 𝐏\mathbf{P} had a different patten, for example, that of a letter L, 𝐏=[100100111], \mathbf{P} = \left[ {\begin{array}{ccc} 1 & 0 & 0\\ 1 & 0 & 0\\ 1 & 1 & 1\\ \end{array} } \right],

然后与𝐅\mathbf{F}会给出较低的结果:55。因此,您可以看到补丁“看起来”越像过滤器,卷积运算的值就越高。为了方便起见,还有一个偏差参数b与每个过滤器相关联𝐅\mathbf{F}在应用非线性(激活函数)之前将其添加到卷积结果中。

then the convolution with 𝐅\mathbf{F} would give a lower result: 55. So, you can see the more the patch “looks” like the filter, the higher the value of the convolution operation is. For convenience, there’s also a bias parameter bb associated with each filter 𝐅\mathbf{F} which is added to the result of a convolution before applying the nonlinearity (activation function).

CNN 的一层由多个卷积滤波器(每个滤波器都有自己的偏差参数)组成,就像普通 FFNN 中的一层由多个单元组成一样。第一(最左边)层的每个过滤器在输入图像上从左到右、从上到下滑动(或卷积),并且在每次迭代时计算卷积。

One layer of a CNN consists of multiple convolution filters (each with its own bias parameter), just like one layer in a vanilla FFNN consists of multiple units. Each filter of the first (leftmost) layer slides — or convolves — across the input image, left to right, top to bottom, and convolution is computed at each iteration.

图 2 给出了该过程的说明。 图 24中显示了一个滤波器在图像上进行卷积的 6 个步骤。

An illustration of the process is given in fig. 24 where 6 steps of one filter convolving across an image are shown.

图 24:在图像上进行卷积的滤波器。
图 24:在图像上进行卷积的滤波器。

滤波器矩阵(每层中的每个滤波器一个)和偏差值是可训练参数,可使用梯度下降和反向传播进行优化。

The filter matrix (one for each filter in each layer) and bias values are trainable parameters that are optimized using gradient descent with backpropagation.

非线性应用于卷积和偏置项的和。通常,ReLU 激活函数用于所有隐藏层。输出层的激活函数取决于任务。

A nonlinearity is applied to the sum of the convolution and the bias term. Typically, the ReLU activation function is used in all hidden layers. The activation function of the output layer depends on the task.

既然我们可以有szesize_l每层都有过滤器l, 卷积层的输出l将包括szesize_l矩阵,每个滤波器一个。

Since we can have sizelsize_l filters in each layer ll, the output of the convolution layer ll would consist of sizelsize_l matrices, one for each filter.

如果 CNN 的一个卷积层后面跟着另一个卷积层,则后续层+1l+1处理前一层的输出l作为一个集合szesize_l图像矩阵。这样的集合称为。该集合的大小称为卷的深度。每层过滤器+1l+1对整个体积进行卷积。体积块的卷积只是该体积所包含的各个矩阵的相应块的卷积之和。

If the CNN has one convolution layer following another convolution layer, then the subsequent layer l+1l+1 treats the output of the preceding layer ll as a collection of sizelsize_l image matrices. Such a collection is called a volume. The size of that collection is called the volume’s depth. Each filter of layer l+1l+1 convolves the whole volume. The convolution of a patch of a volume is simply the sum of convolutions of the corresponding patches of individual matrices the volume consists of.

下面,您可以看到由深度组成的体积块的卷积示例33

Below, you can see an example of a convolution of a patch of a volume consisting of depth 33.

图 25:由三个矩阵组成的体积的卷积。
图 25:由三个矩阵组成的体积的卷积。

卷积的值,-3-3,获得为,

The value of the convolution, 3-3, was obtained as,

-23+31+54+-11+-22+3-1+5-3+-11+-21+3-1+52+-1-1+-2 \begin{split}(-2\cdot 3 + 3\cdot 1 + 5\cdot 4 + -1\cdot 1) \\ + (-2\cdot 2 + 3\cdot(-1) + 5\cdot(-3) + -1\cdot 1) \\ + (-2\cdot 1 + 3\cdot(-1) + 5\cdot 2 + -1\cdot(-1)) \\ + (-2).\end{split}

(23+31+54+11)+(22+3(1)+5(3)+11)+(21+3(1)+52+1(1))+(2). \begin{split}(-2\cdot 3 + 3\cdot 1 + 5\cdot 4 + -1\cdot 1) \\ + (-2\cdot 2 + 3\cdot(-1) + 5\cdot(-3) + -1\cdot 1) \\ + (-2\cdot 1 + 3\cdot(-1) + 5\cdot 2 + -1\cdot(-1)) \\ + (-2).\end{split} .

在计算机视觉中,CNN 通常以体积作为输入,因为图像通常由三个通道表示:R、G 和 B,每个通道都是单色图片。

In computer vision, CNNs often get volumes as input, since an image is usually represented by three channels: R, G, and B, each channel being a monochrome picture.

卷积的两个重要属性是步幅填充。 Stride 是移动窗口的步长。在图中。  24、步幅为11,即过滤器一次向右滑至底部一个单元格。

Two important properties of convolution are stride and padding. Stride is the step size of the moving window. In fig. 24, the stride is 11, that is the filter slides to the right and to the bottom by one cell at a time.

在下图中,您可以看到带有步长的卷积的部分示例22。可以看到,步幅越大,输出矩阵越小。

In the figure below, you can see a partial example of convolution with stride 22. You can see that the output matrix is smaller when stride is bigger.

图 26:步长为 2 的卷积。
图 26:带步长的卷积22

Padding可以得到更大的输出矩阵;它是在与滤波器进行卷积之前围绕图像(或体积)的附加单元格的宽度。通过填充添加的单元格通常包含零。在图中。  24、填充是00,因此没有额外的单元格添加到图像中。

Padding allows getting a larger output matrix; it’s the width of the square of additional cells with which you surround the image (or volume) before you convolve it with the filter. The cells added by padding usually contain zeroes. In fig. 24, the padding is 00, so no additional cells are added to the image.

在图中。  27,另一方面,步幅为22和填充是11,所以一个正方形的宽度11额外的细胞被添加到图像中。您可以看到,当 padding 越大时,输出矩阵就越大5

In fig. 27, on the other hand, the stride is 22 and padding is 11, so a square of width 11 of additional cells are added to the image. You can see that the output matrix is bigger when padding is bigger5.

图 27:步幅为 2 且填充为 1 的卷积。
图 27:带步长的卷积22和填充11

带内边距的图像示例22如下图所示:

An example of an image with padding 22 is shown below:

图 28:带有填充 2 的图像。
图 28:带填充的图像22

填充对于较大的过滤器很有帮助,因为它允许它们更好地“扫描”图像的边界。

Padding is helpful with larger filters because it allows them to better “scan” the boundaries of the image.

如果不介绍池化(CNN 中经常使用的一种技术),本节就不完整。池化的工作方式与卷积非常相似,作为使用移动窗口方法应用的滤波器。然而,池化层不是将可训练的滤波器应用于输入矩阵或体积,而是应用固定的运算符,通常是最大限度\max或者平均的\operatorname{average}。与卷积类似,池化也有超参数:滤波器的大小和步幅。一个例子最大限度\max使用大小的过滤器进行池化22并大步迈进22如下图所示:

This section would not be complete without presenting pooling, a technique very often used in CNNs. Pooling works in a way very similar to convolution, as a filter applied using a moving window approach. However, instead of applying a trainable filter to an input matrix or a volume, pooling layer applies a fixed operator, usually either max\max or average\operatorname{average}. Similarly to convolution, pooling has hyperparameters: the size of the filter and the stride. An example of max\max pooling with filter of size 22 and stride 22 is shown below:

图 29:使用大小为 2 和步幅为 2 的过滤器进行池化。
图 29:使用大小的过滤器进行池化22并大步迈进22

通常,池化层位于卷积层之后,它将卷积的输出作为输入。当池化应用于卷时,卷中的每个矩阵都会独立于其他矩阵进行处理。因此,应用于体积的池化层的输出是与输入具有相同深度的体积。

Usually, a pooling layer follows a convolution layer, and it gets the output of convolution as input. When pooling is applied to a volume, each matrix in the volume is processed independently of others. Therefore, the output of the pooling layer applied to a volume is a volume of the same depth as the input.

正如你所看到的,池化只有超参数,没有需要学习的参数。通常,过滤器的尺寸22或者33并大步迈进22被运用到实践中。最大池化比平均池化更受欢迎,并且通常会给出更好的结果。

As you can see, pooling only has hyperparameters and doesn’t have parameters to learn. Typically, the filter of size 22 or 33 and stride 22 are used in practice. Max pooling is more popular than average and often gives better results.

通常,池化有助于提高模型的准确性。它还通过减少神经网络的参数数量来提高训练速度。 (如图29所示 ,过滤器尺寸22并大步迈进22参数数量减少到25%,即44参数而不是1616.)

Typically pooling contributes to the increased accuracy of the model. It also improves the speed of training by reducing the number of parameters of the neural network. (As you can see in fig. 29, with filter size 22 and stride 22 the number of parameters is reduced to 25%, that is to 44 parameters instead of 1616.)

6.2.2循环神经网络

6.2.2 Recurrent Neural Network

循环神经网络(RNN) 用于标记、分类或生成序列。序列是一个矩阵,其中的每一行都是一个特征向量,并且行的顺序很重要。标记序列就是预测序列中每个特征向量的类别。对序列进行分类就是预测整个序列的类别。生成序列就是输出与输入序列某种程度相关的另一个序列(可能具有不同的长度)。

Recurrent neural networks (RNNs) are used to label, classify, or generate sequences. A sequence is a matrix, each row of which is a feature vector and the order of rows matters. To label a sequence is to predict a class for each feature vector in a sequence. To classify a sequence is to predict a class for the entire sequence. To generate a sequence is to output another sequence (of a possibly different length) somehow relevant to the input sequence.

RNN 经常用于文本处理,因为句子和文本自然是单词/标点符号序列或字符序列。出于同样的原因,循环神经网络也用于语音处理。

RNNs are often used in text processing because sentences and texts are naturally sequences of either words/punctuation marks or sequences of characters. For the same reason, recurrent neural networks are also used in speech processing.

循环神经网络不是前馈的:它包含循环。这个想法是每个单位u循环层的l具有实值状态 H,h_{l,u}。状态可以看作是单元的内存。在RNN中,每个单元u在每一层l接收两个输入:来自上一层的状态向量-1l-1以及同一层的状态向量l上一个时间步开始。

A recurrent neural network is not feed-forward: it contains loops. The idea is that each unit uu of recurrent layer ll has a real-valued state hl,uh_{l,u}. The state can be seen as the memory of the unit. In RNN, each unit uu in each layer ll receives two inputs: a vector of states from the previous layer l1l-1 and the vector of states from this same layer ll from the previous time step.

为了说明这个想法,让我们考虑 RNN 的第一和第二循环层。第一(最左边)层接收特征向量作为输入。第二层接收第一层的输出作为输入。

To illustrate the idea, let’s consider the first and the second recurrent layers of an RNN. The first (leftmost) layer receives a feature vector as input. The second layer receives the output of the first layer as input.

这种情况示意性地描绘在图2中。  30以下。

This situation is schematically depicted in fig. 30 below.

图 30:RNN 的前两层。输入特征向量是二维的;每层有两个单元。
图 30:RNN 的前两层。输入特征向量是二维的;每层有两个单元。

正如我上面所说,每个训练示例都是一个矩阵,其中每一行都是一个特征向量。为了简单起见,我们将该矩阵说明为向量序列𝐗=[𝐱1,𝐱2,……,𝐱t-1,𝐱t,𝐱t+1,……,𝐱enGtH𝐗]\mathbf{X}=[\mathbf{x}^1,\mathbf{x}^2,\ldots,\mathbf{x}^{t-1},\mathbf{x}^t,\mathbf{x}^{t+1},\ldots,\mathbf{x}^{length_{\mathbf{X}}}], 在哪里enGtH𝐗length_{\mathbf{X}}是输入序列的长度。如果我们的输入示例𝐗\mathbf{X}是一个文本句子,然后是特征向量𝐱t\mathbf{x}^{t}对于每个t=1,……,enGtH𝐗t = 1,\ldots, length_{\mathbf{X}}代表句子中位置处的单词tt

As I said above, each training example is a matrix in which each row is a feature vector. For simplicity, let’s illustrate this matrix as a sequence of vectors 𝐗=[𝐱1,𝐱2,,𝐱t1,𝐱t,𝐱t+1,,𝐱length𝐗]\mathbf{X}=[\mathbf{x}^1,\mathbf{x}^2,\ldots,\mathbf{x}^{t-1},\mathbf{x}^t,\mathbf{x}^{t+1},\ldots,\mathbf{x}^{length_{\mathbf{X}}}], where length𝐗length_{\mathbf{X}} is the length of the input sequence. If our input example 𝐗\mathbf{X} is a text sentence, then feature vector 𝐱t\mathbf{x}^{t} for each t=1,,length𝐗t = 1,\ldots, length_{\mathbf{X}} represents a word in the sentence at position tt.

如图所示。 如图30所示,在RNN中,来自输入示例的特征向量由神经网络按照时间步的顺序顺序“读取”。指数tt表示时间步长。更新状态H,th_{l,u}^t在每个时间步tt在每个单元u每层的l我们首先计算输入特征向量与状态向量的线性组合𝐡,t-1\mathbf{h}_{l,u}^{t-1}来自前一个时间步的同一层,t-1t-1。使用两个参数向量计算两个向量的线性组合𝐰,\mathbf{w}_{l,u},𝐮,\mathbf{u}_{l,u}和一个参数,b_{l,u}。的价值H,th_{l,u}^t然后通过应用激活函数获得G1g_1到线性组合的结果。功能的典型选择G1g_1tAnHtanh。输出𝐲t\mathbf{y}^t_{l}通常是为整个层计算的向量l立刻。获得𝐲t\mathbf{y}^t_{l},我们使用激活函数𝐠2\bm{g}_{2}以向量作为输入并返回相同维度的不同向量。功能𝐠2\bm{g}_{2}应用于状态向量值的线性组合𝐡,t\mathbf{h}^t_{l,u}使用参数矩阵计算𝐕\mathbf{V}_{l}和一个参数向量𝐜,\mathbf{c}_{l,u}。在分类中,典型的选择为𝐠2\bm{g}_{2}softmax函数

As depicted in fig. 30, in an RNN, the feature vectors from an input example are “read” by the neural network sequentially in the order of the timesteps. The index tt denotes a timestep. To update the state hl,uth_{l,u}^t at each timestep tt in each unit uu of each layer ll we first calculate a linear combination of the input feature vector with the state vector 𝐡l,ut1\mathbf{h}_{l,u}^{t-1} of this same layer from the previous timestep, t1t-1. The linear combination of two vectors is calculated using two parameter vectors 𝐰l,u\mathbf{w}_{l,u}, 𝐮l,u\mathbf{u}_{l,u} and a parameter bl,ub_{l,u}. The value of hl,uth_{l,u}^t is then obtained by applying activation function g1g_1 to the result of the linear combination. A typical choice for function g1g_1 is tanhtanh. The output 𝐲lt\mathbf{y}^t_{l} is typically a vector calculated for the whole layer ll at once. To obtain 𝐲lt\mathbf{y}^t_{l}, we use activation function 𝐠2\bm{g}_{2} that takes a vector as input and returns a different vector of the same dimensionality. The function 𝐠2\bm{g}_{2} is applied to a linear combination of the state vector values 𝐡l,ut\mathbf{h}^t_{l,u} calculated using a parameter matrix 𝐕l\mathbf{V}_{l} and a parameter vector 𝐜l,u\mathbf{c}_{l,u}. In classification, a typical choice for 𝐠2\bm{g}_{2} is the softmax function:

𝛔𝐳=定义[σ1,……,σD], \bm{\sigma}(\mathbf{z}) \stackrel{\text{def}}{=} [\sigma^{(1)},\ldots,\sigma^{(D)}],

𝛔(𝐳)=def[σ(1),,σ(D)], \bm{\sigma}(\mathbf{z}) \stackrel{\text{def}}{=} [\sigma^{(1)},\ldots,\sigma^{(D)}],

在哪里σj=定义经验值zjΣk=1D经验值zk \sigma^{(j)} \stackrel{\text{def}}{=} \frac{\exp\left(z^{(j)}\right)}{\sum_{k=1}^{D}\exp\left(z^{(k)}\right)}.

where σ(j)=defexp(z(j))k=1Dexp(z(k)). \sigma^{(j)} \stackrel{\text{def}}{=} \frac{\exp\left(z^{(j)}\right)}{\sum_{k=1}^{D}\exp\left(z^{(k)}\right)}.

softmax 函数是 sigmoid 函数对多维输出的推广。它具有这样的属性Σj=1Dσj=1\sum_{j=1}^D \sigma^{(j)} = 1σj>0\sigma^{(j)} > 0对全部jj

The softmax function is a generalization of the sigmoid function to multidimensional outputs. It has the property that j=1Dσ(j)=1\sum_{j=1}^D \sigma^{(j)} = 1 and σ(j)>0\sigma^{(j)} > 0 for all jj.

维数为𝐕\mathbf{V}_l由数据分析师选择,使得矩阵的乘法𝐕\mathbf{V}_l由向量𝐡t\mathbf{h}^t_l结果产生与向量具有相同维数的向量𝐜\mathbf{c}_l。此选择取决于输出标签的维度𝐲\mathbf{y}在你的训练数据中。 (到目前为止,我们只看到了一维标签,但我们将在以后的章节中看到标签也可以是多维的。)

The dimensionality of 𝐕l\mathbf{V}_l is chosen by the data analyst such that multiplication of matrix 𝐕l\mathbf{V}_l by the vector 𝐡lt\mathbf{h}^t_l results in a vector of the same dimensionality as that of the vector 𝐜l\mathbf{c}_l. This choice depends on the dimensionality for the output label 𝐲\mathbf{y} in your training data. (Until now we only saw one-dimensional labels, but we will see in the future chapters that labels can be multidimensional as well.)

的价值观𝐰,\mathbf{w}_{l,u},𝐮,\mathbf{u}_{l,u},,b_{l,u},𝐕,\mathbf{V}_{l,u}, 和𝐜,\mathbf{c}_{l,u}是使用梯度下降和反向传播从训练数据计算出来的。为了训练 RNN 模型,使用了一种特殊版本的反向传播,称为随时间反向传播

The values of 𝐰l,u\mathbf{w}_{l,u}, 𝐮l,u\mathbf{u}_{l,u}, bl,ub_{l,u}, 𝐕l,u\mathbf{V}_{l,u}, and 𝐜l,u\mathbf{c}_{l,u} are computed from the training data using gradient descent with backpropagation. To train RNN models, a special version of backpropagation is used called backpropagation through time.

tanhsoftmax都面临着梯度消失问题。即使我们的 RNN 只有一两个循环层,由于输入的顺序性质,反向传播也必须随着时间的推移“展开”网络。从梯度计算的角度来看,实际上这意味着输入序列越长,展开的网络越深。

Both tanh and softmax suffer from the vanishing gradient problem. Even if our RNN has just one or two recurrent layers, because of the sequential nature of the input, backpropagation has to “unfold” the network over time. From the point of view of the gradient calculation, in practice this means that the longer is the input sequence, the deeper is the unfolded network.

RNN 的另一个问题是处理长期依赖关系。随着输入序列长度的增长,序列开头的特征向量往往会被“遗忘”,因为充当网络内存的每个单元的状态会受到最近读取的特征向量的显着影响。因此,在文本或语音处理中,长句子中相距较远的单词之间的因果联系可能会丢失。

Another problem RNNs have is that of handling long-term dependencies. As the length of the input sequence grows, the feature vectors from the beginning of the sequence tend to be “forgotten,” because the state of each unit, which serves as network’s memory, becomes significantly affected by the feature vectors read more recently. Therefore, in text or speech processing, the cause-effect link between distant words in a long sentence can be lost.

实践中最有效的循环神经网络模型是门控 RNN。其中包括长短期记忆(LSTM) 网络和基于门控循环单元(GRU) 的网络。

The most effective recurrent neural network models used in practice are gated RNNs. These include the long short-term memory (LSTM) networks and networks based on the gated recurrent unit (GRU).

在 RNN 中使用门控单元的优点在于,此类网络可以在其单元中存储信息以供将来使用,就像计算机内存中的位一样。与真实存储器的区别在于,每个单元中存储的信息的读取、写入和擦除是由取值范围内的激活函数控制的0,1(0,1)。经过训练的神经网络可以“读取”特征向量的输入序列并在某个早期时间步做出决定tt保留有关特征向量的特定信息。模型稍后可以使用有关早期特征向量的信息来处理输入序列末尾附近的特征向量。例如,如果输入文本以单词she开头,则语言处理 RNN 模型可以决定存储有关性别的信息,以正确解释他们稍后在句子中看到的单词。

The beauty of using gated units in RNNs is that such networks can store information in their units for future use, much like bits in a computer’s memory. The difference with the real memory is that reading, writing, and erasure of information stored in each unit is controlled by activation functions that take values in the range (0,1)(0,1). The trained neural network can “read” the input sequence of feature vectors and decide at some early time step tt to keep specific information about the feature vectors. That information about the earlier feature vectors can later be used by the model to process the feature vectors from near the end of the input sequence. For example, if the input text starts with the word she, a language processing RNN model could decide to store the information about the gender to interpret correctly the word their seen later in the sentence.

各单元决定存储哪些信息以及何时允许读取、写入和擦除。这些决策是从数据中学习的,并通过的概念来实施。门控单元有多种架构。一种简单但有效的称为最小门控 GRU,由存储单元和遗忘门组成。

Units make decisions about what information to store, and when to allow reads, writes, and erasures. Those decisions are learned from data and implemented through the concept of gates. There are several architectures of gated units. A simple but effective one is called the minimal gated GRU and is composed of a memory cell, and a forget gate.

让我们以 RNN 第一层(将特征向量序列作为输入的层)为例来看看 GRU 单元的数学计算。最小门控 GRU 单元u层内l接受两个输入:来自前一个时间步的同一层中所有单元的记忆单元值的向量,𝐡t-1\mathbf{h}_l^{t-1},和一个特征向量𝐱t\mathbf{x}^{t}。然后它按如下方式使用这两个向量(以下序列中的所有操作都在单元中依次执行):

Let’s look at the math of a GRU unit on an example of the first layer of the RNN (the one that takes the sequence of feature vectors as input). A minimal gated GRU unit uu in layer ll takes two inputs: the vector of the memory cell values from all units in the same layer from the previous timestep, 𝐡lt1\mathbf{h}_l^{t-1}, and a feature vector 𝐱t\mathbf{x}^{t}. It then uses these two vectors as follows (all operations in the below sequence are executed in the unit one after another):

H̃,tG1𝐰,𝐱t+𝐮,𝐡t-1+,,γ,tG2𝐦,𝐱t+𝐨,𝐡t-1+A,,H,tγ,tH̃t+1-γ,tHt-1,𝐡t[H,1t,……,H,szet]𝐲t𝐠3𝐕𝐡t+𝐜,,\begin{equation*} \begin{split} \tilde{h}_{l,u}^t &\gets g_1(\mathbf{w}_{l,u}\mathbf{x}^t + \mathbf{u}_{l,u}\mathbf{h}_l^{t-1} + b_{l,u}), \\ \Gamma_{l,u}^t &\gets g_2(\mathbf{m}_{l,u}\mathbf{x}^t + \mathbf{o}_{l,u}\mathbf{h}^{t-1} + a_{l,u}), \\ h_{l,u}^t &\gets \Gamma_{l,u}^t\tilde{h}_l^t + (1-\Gamma_{l,u}^t)h_l^{t-1}, \\ \mathbf{h}_l^t &\gets [h_{l,1}^t,\ldots,h_{l,size_l}^t] \\ \mathbf{y}_{l}^t &\gets \bm{g}_3(\mathbf{V}_{l}\mathbf{h}_l^t + \mathbf{c}_{l,u}), \end{split} \end{equation*}

h̃l,utg1(𝐰l,u𝐱t+𝐮l,u𝐡lt1+bl,u),Γl,utg2(𝐦l,u𝐱t+𝐨l,u𝐡t1+al,u),hl,utΓl,uth̃lt+(1Γl,ut)hlt1,𝐡lt[hl,1t,,hl,sizelt]𝐲lt𝐠3(𝐕l𝐡lt+𝐜l,u),\begin{equation*} \begin{split} \tilde{h}_{l,u}^t &\gets g_1(\mathbf{w}_{l,u}\mathbf{x}^t + \mathbf{u}_{l,u}\mathbf{h}_l^{t-1} + b_{l,u}), \\ \Gamma_{l,u}^t &\gets g_2(\mathbf{m}_{l,u}\mathbf{x}^t + \mathbf{o}_{l,u}\mathbf{h}^{t-1} + a_{l,u}), \\ h_{l,u}^t &\gets \Gamma_{l,u}^t\tilde{h}_l^t + (1-\Gamma_{l,u}^t)h_l^{t-1}, \\ \mathbf{h}_l^t &\gets [h_{l,1}^t,\ldots,h_{l,size_l}^t] \\ \mathbf{y}_{l}^t &\gets \bm{g}_3(\mathbf{V}_{l}\mathbf{h}_l^t + \mathbf{c}_{l,u}), \end{split} \end{equation*}

在哪里G1g_1tanh激活函数,G2g_2称为门函数,并实现为 sigmoid 函数,取值范围0,1(0,1)。如果门γ,\Gamma_{l,u}接近00,然后存储单元保持前一个时间步的值,Ht-1h_l^{t-1}。另一方面,如果门γ,\Gamma_{l,u}接近11,存储单元的值被新值覆盖H̃,t\tilde{h}_{l,u}^t(参见从上往下数第三个作业)。就像标准 RNN 中一样,𝐠3\bm{g}_3通常是softmax。

where g1g_1 is the tanh activation function, g2g_2 is called the gate function and is implemented as the sigmoid function taking values in the range (0,1)(0,1). If the gate Γl,u\Gamma_{l,u} is close to 00, then the memory cell keeps its value from the previous time step, hlt1h_l^{t-1}. On the other hand, if the gate Γl,u\Gamma_{l,u} is close to 11, the value of the memory cell is overwritten by a new value h̃l,ut\tilde{h}_{l,u}^t (see the third assignment from the top). Just like in standard RNNs, 𝐠3\bm{g}_3 is usually softmax.

门控单元接受输入并将其存储一段时间。这相当于应用恒等函数 (FX=Xf(x)=x) 到输入。由于恒等函数的导数是恒定的,因此当具有门控单元的网络通过时间反向传播进行训练时,梯度不会消失。

A gated unit takes an input and stores it for some time. This is equivalent to applying the identity function (f(x)=xf(x)=x) to the input. Because the derivative of the identity function is constant, when a network with gated units is trained with backpropagation through time, the gradient does not vanish.

RNN 的其他重要扩展包括双向 RNN 、具有注意力的RNN和序列到序列 RNN模型。尤其是后者,经常用于构建神经机器翻译模型和其他文本到文本转换的模型。 RNN 的推广是递归神经网络

Other important extensions to RNNs include bi-directional RNNs, RNNs with attention and sequence-to-sequence RNN models. The latter, in particular, are frequently used to build neural machine translation models and other models for text to text transformations. A generalization of an RNN is a recursive neural network.


  1. 标量函数输出标量,即简单的数字而不是向量。

  2. A scalar function outputs a scalar, that is a simple number and not a vector.

  3. 该函数必须在其整个域或域的大多数点上可微。例如,ReLU 在以下位置不可微分:00

  4. The function has to be differentiable across its whole domain or in the majority of the points of its domain. For example, ReLU is not differentiable at 00.

  5. 图像的每个像素都是一个特征。如果我们的图像是 100 x 100 像素,那么就有 10,000 个特征。

  6. Each pixel of an image is a feature. If our image is 100 by 100 pixels, then there are 10,000 features.

  7. 想象一下,就像您在显微镜下观察一张美元钞票一样。要查看整个账单,您必须逐渐从左到右、从上到下移动账单。在每个时刻,您只能看到固定尺寸账单的一部分。这种方法称为移动窗口

  8. Consider this as if you looked at a dollar bill in a microscope. To see the whole bill you have to gradually move your bill from left to right and from top to bottom. At each moment in time, you see only a part of the bill of fixed dimensions. This approach is called moving window.

  9. 为了节省空间,如图。 在图27中,仅示出了九个卷积中的前两个。

  10. To save space, in fig. 27, only the first two of the nine convolutions are shown.

7问题及解决方案

7 Problems and Solutions

7.1核回归

7.1 Kernel Regression

我们讨论了线性回归,但是如果我们的数据不具有直线形式怎么办?多项式回归可能会有所帮助。假设我们有一个一维数据{X,y}=1\{(x_i,y_i)\}_{i=1}^N。我们可以尝试拟合二次线y=w1X+w2X2+y = w_1 x_i + w_2 x^2_i + b到我们的数据。通过定义均方误差(MSE)成本函数,我们可以应用梯度下降并找到参数值w1w_1,w2w_2, 和b最小化这个成本函数。在一维或二维空间中,我们可以很容易地看出函数是否适合数据。但是,如果我们的输入是DD维特征向量,其中D>3D > 3,找到正确的多项式会很困难。

We talked about linear regression, but what if our data doesn’t have the form of a straight line? Polynomial regression could help. Let’s say we have a one-dimensional data {(xi,yi)}i=1N\{(x_i,y_i)\}_{i=1}^N. We could try to fit a quadratic line y=w1xi+w2xi2+by = w_1 x_i + w_2 x^2_i + b to our data. By defining the mean squared error (MSE) cost function, we could apply gradient descent and find the values of parameters w1w_1, w2w_2, and bb that minimize this cost function. In one- or two-dimensional space, we can easily see whether the function fits the data. However, if our input is a DD-dimensional feature vector, with D>3D > 3, finding the right polynomial would be hard.

核回归是一种非参数方法。这意味着没有需要学习的参数。该模型基于数据本身(如 kNN)。最简单的形式是,在内核回归中,我们寻找这样的模型:

Kernel regression is a non-parametric method. That means that there are no parameters to learn. The model is based on the data itself (like in kNN). In its simplest form, in kernel regression we look for a model like this:

FX=1Σ=1wy,21 f(x) = \frac{1}{N}\sum_{i=1}^N w_i y_i,\qquad(21)

f(x)=1Ni=1Nwiyi,(21) f(x) = \frac{1}{N}\sum_{i=1}^N w_i y_i,\qquad(21)

在哪里

where

w=kX-XΣ=1kX-X w_i = \frac{N k(\frac{x_i - x}{b})}{\sum_{l=1}^N k(\frac{x_l - x}{b})}.

wi=Nk(xixb)l=1Nk(xlxb). w_i = \frac{N k(\frac{x_i - x}{b})}{\sum_{l=1}^N k(\frac{x_l - x}{b})}.

功能kk(\cdot)称为内核。内核扮演相似函数的角色:系数值ww_iXx类似于Xx_i当它们不相似时则更低。核可以有不同的形式,最常用的一种是高斯核:

The function k()k(\cdot) is called a kernel. The kernel plays the role of a similarity function: the values of coefficients wiw_i are higher when xx is similar to xix_i and lower when they are dissimilar. Kernels can have different forms, the most frequently used one is the Gaussian kernel:

kz=12π经验值-z22 k(z) =\frac{1}{\sqrt{2\pi}}\exp{\left(\frac{-z^2}{2}\right)}.

k(z)=12πexp(z22). k(z) =\frac{1}{\sqrt{2\pi}}\exp{\left(\frac{-z^2}{2}\right)}.

图 31:b=3.0 时使用高斯核的核回归线示例(拟合良好)。
图 31:具有高斯核的核回归线示例=3.0b=3.0(很合适)。
图 32:b=0.5 时使用高斯核的核回归线示例(轻微过拟合)。
图 32:使用高斯核的核回归线示例=0.5b=0.5(轻微过度拟合)。
图 33:b=0.1 时使用高斯核的核回归线示例(强过拟合)。
图 33:具有高斯核的核回归线示例=0.1b=0.1(强过拟合)。

价值b是我们使用验证集调整的超参数(通过运行使用特定值构建的模型)b验证集示例并计算 MSE)。你可以看一下影响力的图示b回归线的形状如图所示。  31-图。  33 .

The value bb is a hyperparameter that we tune using the validation set (by running the model built with a specific value of bb on the validation set examples and calculating the MSE). You can see an illustration of the influence bb has on the shape of the regression line in fig. 31-fig. 33.

如果您的输入是多维特征向量,则术语X-Xx_i - xX-Xx_l - x在等式中 21必须用欧氏距离代替𝐱-𝐱\|\mathbf{x}_i - \mathbf{x}\|𝐱-𝐱\|\mathbf{x}_l - \mathbf{x}\|分别。

If your inputs are multi-dimensional feature vectors, the terms xixx_i - x and xlxx_l - x in eq. 21 have to be replaced by Euclidean distance 𝐱i𝐱\|\mathbf{x}_i - \mathbf{x}\| and 𝐱l𝐱\|\mathbf{x}_l - \mathbf{x}\| respectively.

7.2多类分类

7.2 Multiclass Classification

虽然许多分类问题可以使用两个类来定义,但有些分类问题是用两个以上的类来定义的,这需要我们的机器学习算法进行调整。

Although many classification problems can be defined using two classes, some are defined with more than two classes, which requires adaptations of our machine learning algorithms.

在多类分类中,标签可以是以下之一CC课程:yε{1,……,C}y \in \{1,\ldots,C\}。许多机器学习算法都是二进制的;支持向量机就是一个例子。一些算法可以自然地扩展到处理多类问题。 ID3和其他决策树学习算法可以简单地改变如下:

In multiclass classification, the label can be one of CC classes: y{1,,C}y \in \{1,\ldots,C\}. Many machine learning algorithms are binary; SVM is an example. Some algorithms can naturally be extended to handle multiclass problems. ID3 and other decision tree learning algorithms can be simply changed like this:

FD3𝑆=定义普罗y=C|𝐱=1|𝒮|Σ{y|𝐱,yε𝒮,y=C}y,f_{ID3}^{\mathit{S}} \stackrel{\text{def}}{=} \Pr(y_i=c|\mathbf{x}) = \frac{1}{|\mathcal{S}|} \sum_{\{y\, |\, (\mathbf{x}, y) \in \mathcal{S}, y = c\}} y,

fID3𝑆=defPr(yi=c|𝐱)=1|𝒮|{y|(𝐱,y)𝒮,y=c}y,f_{ID3}^{\mathit{S}} \stackrel{\text{def}}{=} \Pr(y_i=c|\mathbf{x}) = \frac{1}{|\mathcal{S}|} \sum_{\{y\, |\, (\mathbf{x}, y) \in \mathcal{S}, y = c\}} y,

对全部Cε{1,……,C}c \in \{1,\ldots,C\}, 在哪里𝒮\mathcal{S}是进行预测的叶节点。

for all c{1,,C}c \in \{1,\ldots,C\}, where 𝒮\mathcal{S} is the leaf node in which the prediction is made.

通过用我们在第 6 章中已经看到的softmax 函数替换 sigmoid函数,逻辑回归可以自然地扩展到多类学习问题。

Logistic regression can be naturally extended to multiclass learning problems by replacing the sigmoid function with the softmax function which we already saw in Chapter 6.

kNN 算法也可以直接扩展到多类情况:当我们找到kk最接近输入的示例𝐱\mathbf{x}并检查它们,我们返回我们在其中看到最多的类kk例子。

The kNN algorithm is also straightforward to extend to the multiclass case: when we find the kk closest examples for the input 𝐱\mathbf{x} and examine them, we return the class that we saw the most among the kk examples.

SVM 不能自然地扩展到多类问题。其他算法可以在二进制情况下更有效地实现。如果您遇到多类问题但采用二元分类学习算法,该怎么办?一种常见的策略称为“一对一”。这个想法是将多类问题转化为CC二元分类问题和构建CC二元分类器。例如,如果我们有三个班级,yε{1,2,3}y \in \{1,2,3\},我们创建原始数据集的副本并修改它们。在第一个副本中,我们替换所有不等于的标签11经过00。在第二个副本中,我们替换所有不等于的标签22经过00。在第三个副本中,我们替换所有不等于的标签33经过00。现在我们有三个二元分类问题,我们必须学会区分标签1100,2200, 和3300

SVM cannot be naturally extended to multiclass problems. Other algorithms can be implemented more efficiently in the binary case. What should you do if you have a multiclass problem but a binary classification learning algorithm? One common strategy is called one versus rest. The idea is to transform a multiclass problem into CC binary classification problems and build CC binary classifiers. For example, if we have three classes, y{1,2,3}y \in \{1,2,3\}, we create copies of the original datasets and modify them. In the first copy, we replace all labels not equal to 11 by 00. In the second copy, we replace all labels not equal to 22 by 00. In the third copy, we replace all labels not equal to 33 by 00. Now we have three binary classification problems where we have to learn to distinguish between labels 11 and 00, 22 and 00, and 33 and 00.

一旦我们有了三个模型,就可以对新的输入特征向量进行分类𝐱\mathbf{x},我们将三个模型应用于输入,并得到三个预测。然后我们选择最确定的非零类的预测。请记住,在逻辑回归中,模型返回的不是标签而是分数(介于0011)可以解释为标签为正的概率。我们也可以将这个分数解释为预测的确定性。在SVM中,确定性的模拟是距离dd从输入𝐱\mathbf{x}到由下式给出的决策边界,d=定义𝐰*𝐱+*w d \stackrel{\text{def}}{=} \frac{\mathbf{w}^*\mathbf{x} + b^*}{\|w\|}.

Once we have the three models, to classify the new input feature vector 𝐱\mathbf{x}, we apply the three models to the input, and we get three predictions. We then pick the prediction of a non-zero class which is the most certain. Remember that in logistic regression, the model returns not a label but a score (between 00 and 11) that can be interpreted as the probability that the label is positive. We can also interpret this score as the certainty of prediction. In SVM, the analog of certainty is the distance dd from the input 𝐱\mathbf{x} to the decision boundary given by, d=def𝐰*𝐱+b*w. d \stackrel{\text{def}}{=} \frac{\mathbf{w}^*\mathbf{x} + b^*}{\|w\|}.

距离越大,预测就越确定。大多数学习算法要么可以自然地转换为多类情况,要么返回一个我们可以在单与休息策略中使用的分数。

The larger the distance, the more certain is the prediction. Most learning algorithm either can be naturally converted to a multiclass case, or they return a score we can use in the one versus rest strategy.

7.3一类分类

7.3 One-Class Classification

有时我们只有一个类别的示例,并且我们希望训练一个模型来区分该类别的示例与其他类别的示例。

Sometimes we only have examples of one class and we want to train a model that would distinguish examples of this class from everything else.

一类分类,也称为一元分类类建模,试图通过从仅包含该类对象的训练集中学习来识别所有对象中特定类的对象。这与传统的分类问题不同,并且比传统的分类问题更困难,传统的分类问题试图通过包含所有类别的对象的训练集来区分两个或多个类别。典型的一类分类问题是对安全计算机网络中的流量进行正常分类。在这种情况下,受到攻击或入侵期间的流量示例(如果有的话)也很少。然而,正常流量的例子往往很多。一类分类学习算法用于异常值检测、异常检测和新颖性检测。

One-class classification, also known as unary classification or class modeling, tries to identify objects of a specific class among all objects, by learning from a training set containing only the objects of that class. That is different from and more difficult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all classes. A typical one-class classification problem is the classification of the traffic in a secure computer network as normal. In this scenario, there are few, if any, examples of the traffic under an attack or during an intrusion. However, the examples of normal traffic are often in abundance. One-class classification learning algorithms are used for outlier detection, anomaly detection, and novelty detection.

有几种一类学习算法。实践中使用最广泛的是一类高斯一类 k-means一类 kNN一类 SVM

There are several one-class learning algorithms. The most widely used in practice are one-class Gaussian, one-class k-means, one-class kNN, and one-class SVM.

一类高斯背后的想法是,我们对数据进行建模,就好像它来自高斯分布,更准确地说是多元正态分布(MND)。 MND 的概率密度函数 (pdf) 由以下等式给出:

The idea behind the one-class gaussian is that we model our data as if it came from a Gaussian distribution, more precisely multivariate normal distribution (MND). The probability density function (pdf) for MND is given by the following equation:

F𝛍,𝚺𝐱=定义e-12𝐱-𝛍𝚺-1𝐱-𝛍2πD|𝚺|, f_{\boldsymbol{\mu},\boldsymbol{\Sigma}}(\mathbf{x}) \stackrel{\text{def}}{=} {\frac {e^{\Bigg(-{\frac {1}{2}}({\mathbf {x} }-{\boldsymbol {\mu }})^\top{\boldsymbol {\Sigma }}^{-1}({\mathbf {x} }-{\boldsymbol {\mu }})\Bigg)}}{\sqrt {(2\pi )^{D}|{\boldsymbol {\Sigma }}|}}},

f𝛍,𝚺(𝐱)=defe(12(𝐱𝛍)𝚺1(𝐱𝛍))(2π)D|𝚺|, f_{\boldsymbol{\mu},\boldsymbol{\Sigma}}(\mathbf{x}) \stackrel{\text{def}}{=} {\frac {e^{\Bigg(-{\frac {1}{2}}({\mathbf {x} }-{\boldsymbol {\mu }})^\top{\boldsymbol {\Sigma }}^{-1}({\mathbf {x} }-{\boldsymbol {\mu }})\Bigg)}}{\sqrt {(2\pi )^{D}|{\boldsymbol {\Sigma }}|}}},

在哪里F𝛍,𝚺𝐱f_{\boldsymbol{\mu},\boldsymbol{\Sigma}}(\mathbf{x})返回输入特征向量对应的概率密度𝐱\mathbf{x}。概率密度可以解释为例子的可能性𝐱\mathbf{x}是从我们建模为 MND 的概率分布中得出的。价值观𝛍\boldsymbol{\mu}(一个向量)和𝚺\boldsymbol{\Sigma}(一个矩阵)是我们必须学习的参数。优化最大似然准则(类似于我们解决逻辑回归学习问题的方式)以找到这两个参数的最佳值。|𝚺|=定义德特𝚺|{\boldsymbol {\Sigma }}| \stackrel{\text{def}}{=} \operatorname {det} {\boldsymbol {\Sigma }}是矩阵的行列式𝚺\boldsymbol {\Sigma};符号𝚺-1{\boldsymbol {\Sigma }}^{-1}表示矩阵的𝚺\boldsymbol {\Sigma }

where f𝛍,𝚺(𝐱)f_{\boldsymbol{\mu},\boldsymbol{\Sigma}}(\mathbf{x}) returns the probability density corresponding to the input feature vector 𝐱\mathbf{x}. Probability density can be interpreted as the likelihood that example 𝐱\mathbf{x} was drawn from the probability distribution we model as an MND. Values 𝛍\boldsymbol{\mu} (a vector) and 𝚺\boldsymbol{\Sigma} (a matrix) are the parameters we have to learn. The maximum likelihood criterion (similarly to how we solved the logistic regression learning problem) is optimized to find the optimal values for these two parameters. |𝚺|=defdet𝚺|{\boldsymbol {\Sigma }}| \stackrel{\text{def}}{=} \operatorname {det} {\boldsymbol {\Sigma }} is the determinant of the matrix 𝚺\boldsymbol {\Sigma}; the notation 𝚺1{\boldsymbol {\Sigma }}^{-1} means the inverse of the matrix 𝚺\boldsymbol {\Sigma }.

如果行列式倒数两个术语对您来说比较陌生,请不要担心。这些是来自称为矩阵论的数学分支的向量和矩阵的标准运算。如果您觉得有必要知道它们是什么,维基百科很好地解释了这些概念。

If the terms determinant and inverse are new to you, don’t worry. These are standard operations on vector and matrices from the branch of mathematics called matrix theory. If you feel the need to know what they are, Wikipedia explains these concepts well.

图 34:使用一类高斯方法求解的一类分类:二维特征向量。
图 34:使用一类高斯方法求解的一类分类:二维特征向量。
图 35:使用一类高斯方法求解的一类分类:使图 35 中示例的可能性最大化的 MND 曲线。 34
图 35:使用一类高斯方法求解的一类分类:使图 35 中示例的可能性最大化的 MND 曲线。  34

实际上,向量中的数字𝛍\boldsymbol {\mu}确定高斯分布曲线的中心位置,而𝚺\boldsymbol {\Sigma }确定曲线的形状。对于由二维特征向量组成的训练集,图 2 给出了一类高斯模型的示例。  34-图。  35 .

In practice, the numbers in the vector 𝛍\boldsymbol {\mu} determine the place where the curve of our Gaussian distribution is centered, while the numbers in 𝚺\boldsymbol {\Sigma } determine the shape of the curve. For a training set consisting of two-dimensional feature vectors, an example of the one-class Gaussian model is given in fig. 34-fig. 35.

一旦我们的模型参数化为𝛍\boldsymbol {\mu}𝚺\boldsymbol{\Sigma}从数据中学习,我们预测每个输入的可能性𝐱\mathbf{x}通过使用F𝛍,𝚺𝐱f_{\boldsymbol{\mu},\boldsymbol{\Sigma}}(\mathbf{x})。仅当可能性高于某个阈值时,我们才预测该示例属于我们的类别;否则,将其归类为异常值。阈值是通过实验或使用“有根据的猜测”找到的。

Once we have our model parametrized by 𝛍\boldsymbol {\mu} and 𝚺\boldsymbol{\Sigma} learned from data, we predict the likelihood of every input 𝐱\mathbf{x} by using f𝛍,𝚺(𝐱)f_{\boldsymbol{\mu},\boldsymbol{\Sigma}}(\mathbf{x}). Only if the likelihood is above a certain threshold, we predict that the example belongs to our class; otherwise, it is classified as the outlier. The value of the threshold is found experimentally or using an “educated guess.”

当数据具有更复杂的形状时,更高级的算法可以使用多个高斯的组合(称为高斯混合)。在这种情况下,有更多的参数需要从数据中学习:一𝛍\boldsymbol{\mu}和一个𝚺\boldsymbol{\Sigma}对于每个高斯以及允许组合多个高斯形成一个 pdf 的参数。在第 9 章中,我们考虑混合高斯分布及其在聚类中的应用。

When the data has a more complex shape, a more advanced algorithm can use a combination of several Gaussians (called a mixture of Gaussians). In this case, there are more parameters to learn from data: one 𝛍\boldsymbol{\mu} and one 𝚺\boldsymbol{\Sigma} for each Gaussian as well as the parameters that allow combining multiple Gaussians to form one pdf. In Chapter 9, we consider a mixture of Gaussians with an application to clustering.

一类 k 均值和一类 kNN 基于与一类高斯类似的原理:构建一些数据模型,然后定义一个阈值来决定我们的新特征向量是否与其他示例相似该模型。在前者中,所有训练示例都使用k-means聚类算法进行聚类,并且当一个新示例𝐱\mathbf{x}被观察到,距离d𝐱d(\mathbf{x})计算为之间的最小距离𝐱\mathbf{x}以及每个簇的中心。如果d𝐱d(\mathbf{x})小于特定阈值,则𝐱\mathbf{x}属于班级。

One-class k-means and one-class kNN are based on a similar principle as that of one-class Gaussian: build some model of the data and then define a threshold to decide whether our new feature vector looks similar to other examples according to the model. In the former, all training examples are clustered using the k-means clustering algorithm and, when a new example 𝐱\mathbf{x} is observed, the distance d(𝐱)d(\mathbf{x}) is calculated as the minimum distance between 𝐱\mathbf{x} and the center of each cluster. If d(𝐱)d(\mathbf{x}) is less than a particular threshold, then 𝐱\mathbf{x} belongs to the class.

根据公式的不同,一类 SVM 尝试 1) 将所有训练样例与原点(在特征空间中)分离并最大化超平面到原点的距离,或者 2) 通过以下方式获得数据周围的球形边界:最小化这个超球面的体积。我留下一类 kNN 算法的描述,以及一类 k 均值和一类 SVM 的详细信息,以供补充阅读。

One-class SVM, depending on formulation, tries either 1) to separate all training examples from the origin (in the feature space) and maximize the distance from the hyperplane to the origin, or 2) to obtain a spherical boundary around the data by minimizing the volume of this hypersphere. I leave the description of the one-class kNN algorithm, as well as the details of the one-class k-means and one-class SVM for the complementary reading.

7.4多标签分类

7.4 Multi-Label Classification

在某些情况下,多个标签适合描述数据集中的示例。在这种情况下,我们谈论多标签分类

In some situations, more than one label is appropriate to describe an example from the dataset. In this case, we talk about the multi-label classification.

例如,如果我们想要描述一幅图像,我们可以为其分配几个标签:“针叶树”、“山”、“道路”,同时这三个标签(图 36)。

For instance, if we want to describe an image, we could assign several labels to it: “conifer,” “mountain,” “road,” all three at the same time (fig. 36).

图 36:标记为“针叶树”、“山”和“道路”的图片。照片:凯特·拉加迪亚。
图 36:标记为“针叶树”、“山”和“道路”的图片。照片:凯特·拉加迪亚。

如果标签的可能值的数量很多,但它们都具有相同的性质,例如标签,我们可以将每个带标签的示例转换为多个带标签的示例,每个标签一个。这些新示例都具有相同的特征向量和只有一个标签。这就变成了一个多类分类问题。我们可以使用一对一策略来解决这个问题。与通常的多类问题的唯一区别是,现在我们有了一个新的超参数:阈值。如果某个标签的预测分数高于阈值,则针对输入特征向量预测该标签。在这种情况下,可以为一个特征向量预测多个标签。使用验证集选择阈值。

If the number of possible values for labels is high, but they are all of the same nature, like tags, we can transform each labeled example into several labeled examples, one per label. These new examples all have the same feature vector and only one label. That becomes a multiclass classification problem. We can solve it using the one versus rest strategy. The only difference with the usual multiclass problem is that now we have a new hyperparameter: threshold. If the prediction score for some label is above the threshold, this label is predicted for the input feature vector. In this scenario, multiple labels can be predicted for one feature vector. The value of the threshold is chosen using the validation set.

类似地,自然可以进行多分类的算法(决策树、逻辑回归和神经网络等)可以应用于多标签分类问题。因为它们返回每个类别的分数,所以我们可以定义一个阈值,然后如果阈值高于某个值,则将多个标签分配给一个特征向量。

Analogously, algorithms that naturally can be made multiclass (decision trees, logistic regression and neural networks among others) can be applied to multi-label classification problems. Because they return the score for each class, we can define a threshold and then assign multiple labels to one feature vector if the threshold is above some value.

神经网络算法可以利用二元交叉熵成本函数自然地训练多标签分类模型。在这种情况下,神经网络的输出层每个标签都有一个单元。输出层的每个单元都有sigmoid激活函数。相应地,每个标签l是二进制的(y,ε{0,1}y_{i,l} \in \{0,1\}), 在哪里=1,……,Ll = 1,\ldots,L=1,……,i = 1,\ldots,N。预测概率的二元交叉熵ŷ,\hat{y}_{i,l}那个例子𝐱\mathbf{x}_i有标签l定义为,

Neural networks algorithms can naturally train multi-label classification models by using the binary cross-entropy cost function. The output layer of the neural network, in this case, has one unit per label. Each unit of the output layer has the sigmoid activation function. Accordingly, each label ll is binary (yi,l{0,1}y_{i,l} \in \{0,1\}), where l=1,,Ll = 1,\ldots,L and i=1,,Ni = 1,\ldots,N. The binary cross-entropy of predicting the probability ŷi,l\hat{y}_{i,l} that example 𝐱i\mathbf{x}_i has label ll is defined as,

-y,ŷ,+1-y,1-ŷ, -(y_{i,l} \ln(\hat{y}_{i,l})+(1-y_{i,l})\ln(1-\hat{y}_{i,l})).

(yi,lln(ŷi,l)+(1yi,l)ln(1ŷi,l)). -(y_{i,l} \ln(\hat{y}_{i,l})+(1-y_{i,l})\ln(1-\hat{y}_{i,l})).

最小化标准只是所有训练示例和这些示例的所有标签的所有二元交叉熵项的平均值。

The minimization criterion is simply the average of all binary cross-entropy terms across all training examples and all labels of those examples.

如果每个标签可以采用的可能值的数量很少,则可以使用不同的方法将多标签问题转换为多类问题。想象一下以下问题。我们想要标记图像,标签可以有两种类型。第一种类型的标签可以有两个可能的值:{pHt,pAntnG}\{photo, painting\};第二种类型的标签可以有三个可能的值{prtrAt,pAysAGe,tHer}\{portrait, paysage, other\}。我们可以为两个原始类的每个组合创建一个新的假类,如下所示:

In cases where the number of possible values each label can take is small, one can convert multilabel into a multiclass problem using a different approach. Imagine the following problem. We want to label images and labels can be of two types. The first type of label can have two possible values: {photo,painting}\{photo, painting\}; the label of the second type can have three possible values {portrait,paysage,other}\{portrait, paysage, other\}. We can create a new fake class for each combination of the two original classes, like this:

假班级 真实1级 真实2级
1 照片 肖像
2 照片 风景
3 照片 其他
4 绘画 肖像
5 绘画 风景
6 绘画 其他

现在我们有相同的标记示例,但我们用一个假标签替换真正的多标签,其值来自1166。当没有太多可能的类组合时,这种方法在实践中效果很好。否则,您需要使用更多的训练数据来补偿增加的类集。

Now we have the same labeled examples, but we replace real multi-labels with one fake label with values from 11 to 66. This approach works well in practice when there are not too many possible combinations of classes. Otherwise, you need to use much more training data to compensate for an increased set of classes.

后一种方法的主要优点是使标签保持相关性,这与之前看到的相互独立预测每个标签的方法相反。标签之间的相关性在许多问题中至关重要。例如,如果您想在预测一封电子邮件是普通电子邮件还是优先电子邮件的同时,预测该电子邮件是垃圾邮件还是not_spam。你想避免像这样的预测[spA,prrty][spam, priority]

The primary advantage of this latter approach is that you keep your labels correlated, contrary to the previously seen methods that predict each label independently of one another. Correlation between labels can be essential in many problems. For example, if you want to predict whether an email is spam or not_spam at the same time as predicting whether it’s ordinary or priority email. You would like to avoid predictions like [spam,priority][spam, priority].

7.5集成学习

7.5 Ensemble Learning

我们在第 3 章中考虑的基本算法有其局限性。由于它们的简单性,有时它们无法为您的问题生成足够准确的模型。您可以尝试使用深度神经网络。然而,在实践中,深度神经网络需要大量您可能没有的标记数据。另一种提高简单学习算法性能的方法是集成学习

The fundamental algorithms that we considered in Chapter 3 have their limitations. Because of their simplicity, sometimes they cannot produce a model accurate enough for your problem. You could try using deep neural networks. However, in practice, deep neural networks require a significant amount of labeled data which you might not have. Another approach to boost the performance of simple learning algorithms is ensemble learning.

集成学习是一种学习范式,它不是试图学习一个超准确的模型,而是专注于训练大量的低准确度模型,然后组合这些模型给出的预测以获得高精度的元模型

Ensemble learning is a learning paradigm that, instead of trying to learn one super-accurate model, focuses on training a large number of low-accuracy models and then combining the predictions given by those weak models to obtain a high-accuracy meta-model.

低精度模型通常由弱学习器学习,即无法学习复杂模型的学习算法,因此在训练和预测时通常速度很快。最常用的弱学习器是决策树学习算法,在该算法中,我们经常在几次迭代后就停止分割训练集。获得的树是浅层的并且不是特别准确,但是集成学习背后的思想是,如果树不相同并且每棵树至少比随机猜测稍好,那么我们可以通过组合大量这样的树来获得高精度。

Low-accuracy models are usually learned by weak learners, that is, learning algorithms that cannot learn complex models, and thus are typically fast at the training and at the prediction time. The most frequently used weak learner is a decision tree learning algorithm in which we often stop splitting the training set after just a few iterations. The obtained trees are shallow and not particularly accurate, but the idea behind ensemble learning is that if the trees are not identical and each tree is at least slightly better than random guessing, then we can obtain high accuracy by combining a large number of such trees.

获得输入的预测𝐱\mathbf{x},每个弱模型的预测都使用某种加权投票进行组合。投票权重的具体形式取决于算法,但是,独立于算法,其想法是相同的:如果弱模型理事会预测该消息是垃圾邮件,那么我们将垃圾邮件标签分配给𝐱\mathbf{x}

To obtain the prediction for input 𝐱\mathbf{x}, the predictions of each weak model are combined using some sort of weighted voting. The specific form of vote weighting depends on the algorithm, but, independently of the algorithm, the idea is the same: if the council of weak models predicts that the message is spam, then we assign the label spam to 𝐱\mathbf{x}.

两种主要的集成学习方法是boostingbagging

Two principal ensemble learning methods are boosting and bagging.

7.5.1 Boosting 和 Bagging

7.5.1 Boosting and Bagging

Boosting 包括使用原始训练数据并使用弱学习器迭代创建多个模型。每个新模型都与以前的模型不同,因为弱学习者通过构建每个新模型试图“修复”以前模型所犯的错误。最终的集成模型是迭代构建的多个弱模型的某种组合。

Boosting consists of using the original training data and iteratively creating multiple models by using a weak learner. Each new model would be different from the previous ones in the sense that the weak learner, by building each new model tries to “fix” the errors which previous models make. The final ensemble model is a certain combination of those multiple weak models built iteratively.

Bagging 包括创建训练数据的许多“副本”(每个副本都与另一个副本略有不同),然后将弱学习器应用于每个副本以获得多个弱模型,然后将它们组合起来。一种广泛使用且有效的基于装袋思想的机器学习算法是随机森林

Bagging consists of creating many “copies” of the training data (each copy is slightly different from another) and then apply the weak learner to each copy to obtain multiple weak models and then combine them. A widely used and effective machine learning algorithm based on the idea of bagging is random forest.

7.5.2随机森林

7.5.2 Random Forest

“vanilla”装袋算法的工作原理如下。给定一个训练集,我们创建B随机样本𝒮\mathcal{S}_b(对于每个=1,……,b=1,\ldots, B) 训练集并构建决策树模型Ff_b使用每个样本𝒮\mathcal{S}_b作为训练集。来样𝒮\mathcal{S}_b对于一些b,我们进行放回抽样。这意味着我们从一个空集开始,然后从训练集中随机选择一个示例,并将其精确副本放入𝒮\mathcal{S}_b通过将原始示例保留在原始训练集中。我们不断随机挑选例子,直到|𝒮|=|\mathcal{S}_b| = N

The “vanilla” bagging algorithm works as follows. Given a training set, we create BB random samples 𝒮b\mathcal{S}_b (for each b=1,,Bb=1,\ldots, B) of the training set and build a decision tree model fbf_b using each sample 𝒮b\mathcal{S}_b as the training set. To sample 𝒮b\mathcal{S}_b for some bb, we do the sampling with replacement. This means that we start with an empty set, and then pick at random an example from the training set and put its exact copy to 𝒮b\mathcal{S}_b by keeping the original example in the original training set. We keep picking examples at random until the |𝒮b|=N|\mathcal{S}_b| = N.

经过训练,我们有B决策树。对新示例的预测𝐱\mathbf{x}获得的平均值B预测:

After training, we have BB decision trees. The prediction for a new example 𝐱\mathbf{x} is obtained as the average of BB predictions:

yF̂𝐱=定义1Σ=1F𝐱, y \leftarrow \hat{f}(\mathbf{x}) \stackrel{\text{def}}{=} \frac{1}{B}\sum_{b=1}^B f_{b}(\mathbf{x}),

yf̂(𝐱)=def1Bb=1Bfb(𝐱), y \leftarrow \hat{f}(\mathbf{x}) \stackrel{\text{def}}{=} \frac{1}{B}\sum_{b=1}^B f_{b}(\mathbf{x}),

在回归的情况下,或者在分类的情况下通过多数投票。

in the case of regression, or by taking the majority vote in the case of classification.

随机森林与普通装袋法只有一处不同。它使用改进的树学习算法,在学习过程中的每次分割时检查特征的随机子集。这样做的原因是为了避免树的相关性:如果一个或几个特征对于目标来说是非常强的预测因子,那么这些特征将被选择来分割许多树中的示例。这将导致我们的“森林”中出现许多相关的树。相关预测变量无助于提高预测的准确性。模型集成性能更好的主要原因是,好的模型可能会同意相同的预测,而坏的模型可能会不同意不同的预测。相关性将使糟糕的模型更有可能达成一致,这将妨碍多数投票或平均值。

Random forest is different from the vanilla bagging in just one way. It uses a modified tree learning algorithm that inspects, at each split in the learning process, a random subset of the features. The reason for doing this is to avoid the correlation of the trees: if one or a few features are very strong predictors for the target, these features will be selected to split examples in many trees. This would result in many correlated trees in our “forest.” Correlated predictors cannot help in improving the accuracy of prediction. The main reason behind a better performance of model ensembling is that models that are good will likely agree on the same prediction, while bad models will likely disagree on different ones. Correlation will make bad models more likely to agree, which will hamper the majority vote or the average.

要调整的最重要的超参数是树的数量,B,以及每次分割时要考虑的随机特征子集的大小。

The most important hyperparameters to tune are the number of trees, BB, and the size of the random subset of the features to consider at each split.

随机森林是最广泛使用的集成学习算法之一。为什么这么有效?原因是通过使用原始数据集的多个样本,我们减少了最终模型的方差。请记住,低方差意味着低过度拟合。当我们的模型试图解释数据集中的微小变化时,就会发生过度拟合,因为我们的数据集只是我们尝试建模的现象的所有可能示例的总体的一小部分样本。如果我们对训练集的采样方式不走运,那么它可能会包含一些不需要的(但不可避免的)伪影:噪声、异常值以及过度或代表性不足的示例。通过创建多个随机样本并替换我们的训练集,我们减少了这些伪影的影响。

Random forest is one of the most widely used ensemble learning algorithms. Why is it so effective? The reason is that by using multiple samples of the original dataset, we reduce the variance of the final model. Remember that the low variance means low overfitting. Overfitting happens when our model tries to explain small variations in the dataset because our dataset is just a small sample of the population of all possible examples of the phenomenon we try to model. If we were unlucky with how our training set was sampled, then it could contain some undesirable (but unavoidable) artifacts: noise, outliers and over- or underrepresented examples. By creating multiple random samples with replacement of our training set, we reduce the effect of these artifacts.

7.5.3梯度提升

7.5.3 Gradient Boosting

另一种有效的集成学习算法是基于 boosting 的思想,即梯度提升。让我们首先看看回归的梯度提升。为了构建一个强大的回归器,我们从一个常数模型开始F=F0f = f_0(就像我们在 ID3 中所做的那样):

Another effective ensemble learning algorithm, based on the idea of boosting, is gradient boosting. Let’s first look at gradient boosting for regression. To build a strong regressor, we start with a constant model f=f0f = f_0 (just like we did in ID3):

F=F0𝐱=定义1Σ=1y f = f_0(\mathbf{x}) \stackrel{\text{def}}{=} \frac{1}{N}\sum_{i=1}^N y_i.

f=f0(𝐱)=def1Ni=1Nyi. f = f_0(\mathbf{x}) \stackrel{\text{def}}{=} \frac{1}{N}\sum_{i=1}^N y_i.

然后我们修改每个例子的标签=1,……,i = 1,\ldots,N在我们的训练集中如下:

Then we modify labels of each example i=1,,Ni = 1,\ldots,N in our training set as follows:

ŷy-F𝐱,22 \hat{y}_i \leftarrow {y}_i - f(\mathbf{x}_i),\qquad(22)

ŷiyif(𝐱i),(22) \hat{y}_i \leftarrow {y}_i - f(\mathbf{x}_i),\qquad(22)

在哪里ŷ\hat{y}_{i},称为残差,是新标签,例如𝐱\mathbf{x}_i

where ŷi\hat{y}_{i}, called the residual, is the new label for example 𝐱i\mathbf{x}_i.

现在我们使用修改后的训练集,用残差代替原始标签,来构建新的决策树模型,F1f_1。提升模型现在定义为F=定义F0+αF1f \stackrel{\text{def}}{=} f_0 + \alpha f_1, 在哪里α\alpha是学习率(超参数)。

Now we use the modified training set, with residuals instead of original labels, to build a new decision tree model, f1f_1. The boosting model is now defined as f=deff0+αf1f \stackrel{\text{def}}{=} f_0 + \alpha f_1, where α\alpha is the learning rate (a hyperparameter).

然后我们使用等式重新计算残差。  22再次替换训练数据中的标签,训练新的决策树模型F2f_2,将Boosting模型重新定义为F=定义F0+αF1+αF2f \stackrel{\text{def}}{=} f_0 + \alpha f_1 + \alpha f_2这个过程一直持续到预定义的最大值中号M的树组合在一起。

Then we recompute the residuals using eq. 22 and replace the labels in the training data once again, train the new decision tree model f2f_2, redefine the boosting model as f=deff0+αf1+αf2f \stackrel{\text{def}}{=} f_0 + \alpha f_1 + \alpha f_2 and the process continues until the predefined maximum MM of trees are combined.

直觉上,这里发生了什么?通过计算残差,我们可以发现当前模型对每个训练示例的目标的预测效果如何(或较差)Ff。然后,我们训练另一棵树来修复当前模型的错误(这就是我们使用残差而不是真实标签的原因),并将这棵新树添加到具有一定权重的现有模型中α\alpha。因此,添加到模型中的每棵附加树都会部分修复先前树所产生的错误,直到达到最大数量中号M(另一个超参数)树被组合。

Intuitively, what’s happening here? By computing the residuals, we find how well (or poorly) the target of each training example is predicted by the current model ff. We then train another tree to fix the errors of the current model (this is why we use residuals instead of real labels) and add this new tree to the existing model with some weight α\alpha. Therefore, each additional tree added to the model partially fixes the errors made by the previous trees until the maximum number MM (another hyperparameter) of trees are combined.

现在你应该合理地问为什么该算法被称为梯度提升?在梯度提升中,我们不会计算任何与第 4 章线性回归中所做的相反的梯度。要了解梯度提升和梯度下降之间的相似性,请记住我们在线性回归中计算梯度的原因:我们这样做是为了了解应该将参数值移动到哪里,以便 MSE 成本函数达到最小值。渐变显示了方向,但我们不知道这个方向应该走多远,所以我们使用了一小步α\alpha在每次迭代中,然后重新评估我们的方向。同样的情况也发生在梯度提升中。然而,我们不是直接获取梯度,而是以残差的形式使用它的代理:它们向我们展示了如何调整模型以减少误差(残差)。

Now you should reasonably ask why the algorithm is called gradient boosting? In gradient boosting, we don’t calculate any gradient contrary to what we did in Chapter 4 for linear regression. To see the similarity between gradient boosting and gradient descent remember why we calculated the gradient in linear regression: we did that to get an idea on where we should move the values of our parameters so that the MSE cost function reaches its minimum. The gradient showed the direction, but we didn’t know how far we should go in this direction, so we used a small step α\alpha at each iteration and then reevaluated our direction. The same happens in gradient boosting. However, instead of getting the gradient directly, we use its proxy in the form of residuals: they show us how the model has to be adjusted so that the error (the residual) is reduced.

梯度提升中需要调整的三个主要超参数是树的数量、学习率和树的深度——这三个参数都会影响模型的准确性。树的深度也会影响训练和预测的速度:越短,越快。

The three principal hyperparameters to tune in gradient boosting are the number of trees, the learning rate, and the depth of trees — all three affect model accuracy. The depth of trees also affects the speed of training and prediction: the shorter, the faster.

可以看出,残差训练优化了整体模型Ff为均方误差准则。您可以在这里看到与 bagging 的区别:boosting 减少了偏差(或欠拟合)而不是方差。因此,Boosting 可能会过度拟合。然而,通过调整深度和树的数量,可以在很大程度上避免过度拟合。

It can be shown that training on residuals optimizes the overall model ff for the mean squared error criterion. You can see the difference with bagging here: boosting reduces the bias (or underfitting) instead of the variance. As such, boosting can overfit. However, by tuning the depth and the number of trees, overfitting can be largely avoided.

分类的梯度提升类似,但步骤略有不同。让我们考虑二进制情况。假设我们有中号M回归决策树。与逻辑回归类似,决策树集合的预测是使用 sigmoid 函数建模的:

Gradient boosting for classification is similar, but the steps are slightly different. Let’s consider the binary case. Assume we have MM regression decision trees. Similarly to logistic regression, the prediction of the ensemble of decision trees is modeled using the sigmoid function:

普罗y=1|𝐱,F=定义11+e-F𝐱, \Pr(y = 1|\mathbf{x}, f) \stackrel{\text{def}}{=} \frac{1}{1+e^{-f(\mathbf{x})}},

Pr(y=1|𝐱,f)=def11+ef(𝐱), \Pr(y = 1|\mathbf{x}, f) \stackrel{\text{def}}{=} \frac{1}{1+e^{-f(\mathbf{x})}},

在哪里F𝐱=定义Σ=1中号F𝐱f(\mathbf{x}) \stackrel{\text{def}}{=} \sum_{m=1}^M f_m(\mathbf{x})Ff_m是一棵回归树。

where f(𝐱)=defm=1Mfm(𝐱)f(\mathbf{x}) \stackrel{\text{def}}{=} \sum_{m=1}^M f_m(\mathbf{x}) and fmf_m is a regression tree.

再次,就像在逻辑回归中一样,我们通过尝试找到这样一个最大似然原理来应用Ff最大化LF=Σ=1[普罗y=1|𝐱,F]L_f = \sum_{i=1}^N \ln\left[\Pr(y_i = 1|\mathbf{x}_i, f)\right]。同样,为了避免数值溢出,我们最大化对数似然之和而不是似然乘积。

Again, like in logistic regression, we apply the maximum likelihood principle by trying to find such an ff that maximizes Lf=i=1Nln[Pr(yi=1|𝐱i,f)]L_f = \sum_{i=1}^N \ln\left[\Pr(y_i = 1|\mathbf{x}_i, f)\right]. Again, to avoid numerical overflow, we maximize the sum of log-likelihoods rather than the product of likelihoods.

该算法从初始常数模型开始F=F0=p1-pf = f_0 = \frac{p}{1-p}, 在哪里p=1Σ=1yp = \frac{1}{N}\sum_{i=1}^N y_i。 (可以证明这样的初始化对于 sigmoid 函数来说是最优的。)然后在每次迭代时m,一棵新树Ff_m被添加到模型中。寻找最好的Ff_m,首先是偏导数Gg_i当前模型的计算为每个=1,……,i=1,\ldots, N:

The algorithm starts with the initial constant model f=f0=p1pf = f_0 = \frac{p}{1-p}, where p=1Ni=1Nyip = \frac{1}{N}\sum_{i=1}^N y_i. (It can be shown that such initialization is optimal for the sigmoid function.) Then at each iteration mm, a new tree fmf_m is added to the model. To find the best fmf_m, first the partial derivative gig_i of the current model is calculated for each i=1,,Ni=1,\ldots, N:

G=dLFdF,g_i = \frac{dL_f}{df},

gi=dLfdf,g_i = \frac{dL_f}{df},

在哪里Ff是在上一次迭代中构建的集成分类器模型-1m-1。计算Gg_i我们需要找到的导数[普罗y=1|𝐱,F]\ln\left[\Pr(y_i = 1|\mathbf{x}_i, f)\right]关于Ff对全部i。请注意[普罗y=1|𝐱,F]=定义[11+e-F𝐱]\ln\left[\Pr(y_i = 1|\mathbf{x}_i, f)\right] \stackrel{\text{def}}{=} \ln\left[\frac{1}{1+e^{-f(\mathbf{x}_i)}}\right]。上式中右侧项关于以下项的导数Ff等于1eF𝐱+1\frac{1}{e^{f(\mathbf{x}_i)} + 1}

where ff is the ensemble classifier model built at the previous iteration m1m-1. To calculate gig_i we need to find the derivatives of ln[Pr(yi=1|𝐱i,f)]\ln\left[\Pr(y_i = 1|\mathbf{x}_i, f)\right] with respect to ff for all ii. Notice that ln[Pr(yi=1|𝐱i,f)]=defln[11+ef(𝐱i)]\ln\left[\Pr(y_i = 1|\mathbf{x}_i, f)\right] \stackrel{\text{def}}{=} \ln\left[\frac{1}{1+e^{-f(\mathbf{x}_i)}}\right]. The derivative of the right-hand term in the previous equation with respect to ff equals 1ef(𝐱i)+1\frac{1}{e^{f(\mathbf{x}_i)} + 1}.

然后我们通过替换原始标签来转换我们的训练集yy_i与相应的偏导数Gg_i,并构建一棵新树Ff_m使用转换后的训练集。然后我们找到最优的更新步长ρ\rho_m作为:

We then transform our training set by replacing the original label yiy_i with the corresponding partial derivative gig_i, and build a new tree fmf_m using the transformed training set. Then we find the optimal update step ρm\rho_m as:

ρ最大精量ρLF+ρF\rho_m \gets \mathop{\mathrm{arg\,max}}_{\rho} L_{f + \rho f_m}.

ρmarg maxρLf+ρfm.\rho_m \gets \mathop{\mathrm{arg\,max}}_{\rho} L_{f + \rho f_m}.

迭代结束时m,我们更新集成模型Ff通过添加新树Ff_m:

At the end of iteration mm, we update the ensemble model ff by adding the new tree fmf_m:

FF+αρF f \gets f + \alpha\rho_m f_m.

ff+αρmfm. f \gets f + \alpha\rho_m f_m.

我们迭代直到=中号m = M,然后我们停止并返回集成模型Ff

We iterate until m=Mm = M, then we stop and return the ensemble model ff.

梯度提升是最强大的机器学习算法之一--不仅因为它创建了非常准确的模型,还因为它能够处理包含数百万个示例和特征的庞大数据集。它的准确性通常优于随机森林,但由于其顺序性质,训练速度可能会慢得多。

Gradient boosting is one of the most powerful machine learning algorithms-not just because it creates very accurate models, but also because it is capable of handling huge datasets with millions of examples and features. It usually outperforms random forest in accuracy but, because of its sequential nature, can be significantly slower in training.

7.6学习标记序列

7.6 Learning to Label Sequences

序列是最常见的结构化数据类型之一。我们使用单词和句子的序列进行交流,我们按顺序执行任务,我们的基因、我们听的音乐和我们观看的视频、我们对连续过程(例如移动的汽车或股票价格)的观察都是连续的。

Sequence is one the most frequently observed types of structured data. We communicate using sequences of words and sentences, we execute tasks in sequences, our genes, the music we listen and videos we watch, our observations of a continuous process, such as a moving car or the price of a stock are all sequential.

序列标记是自动为序列的每个元素分配标签的问题。序列标记中的标记顺序训练示例是一对列表𝐗,𝐘(\mathbf{X}, \mathbf{Y}), 在哪里𝐗\mathbf{X}是一个特征向量列表,每个时间步一个,𝐘\mathbf{Y}是相同长度标签的列表。例如,𝐗\mathbf{X}可以代表句子中的单词,例如 [“big”、“beautiful”、“car”],以及𝐘\mathbf{Y}将是相应词性的列表,例如[“形容词”,“形容词”,“名词”])。更正式地说,在一个例子中i,𝐗=[𝐱1,𝐱2,……,𝐱sze]\mathbf{X}_i = [\mathbf{x}_i^1,\mathbf{x}_i^2,\ldots,\mathbf{x}_i^{size_i}], 在哪里szesize_i是示例序列的长度i,𝐘=[y1,y2,……,ysze]\mathbf{Y}_i = [y_i^1,y_i^2,\ldots,y_i^{size_i}]yε{1,2,……,C}y_i\in\{1,2,\ldots,C\}

Sequence labeling is the problem of automatically assigning a label to each element of a sequence. A labeled sequential training example in sequence labeling is a pair of lists (𝐗,𝐘)(\mathbf{X}, \mathbf{Y}), where 𝐗\mathbf{X} is a list of feature vectors, one per time step, 𝐘\mathbf{Y} is a list of the same length of labels. For example, 𝐗\mathbf{X} could represent words in a sentence such as [“big”, “beautiful”, “car”], and 𝐘\mathbf{Y} would be the list of the corresponding parts of speech, such as [“adjective”, “adjective”, “noun”]). More formally, in an example ii, 𝐗i=[𝐱i1,𝐱i2,,𝐱isizei]\mathbf{X}_i = [\mathbf{x}_i^1,\mathbf{x}_i^2,\ldots,\mathbf{x}_i^{size_i}], where sizeisize_i is the length of the sequence of the example ii, 𝐘i=[yi1,yi2,,yisizei]\mathbf{Y}_i = [y_i^1,y_i^2,\ldots,y_i^{size_i}] and yi{1,2,,C}y_i\in\{1,2,\ldots,C\}.

您已经看到 RNN 可用于标记序列。在每个时间步tt,它读取输入特征向量𝐱t\mathbf{x}_i^{(t)},最后一个循环层输出一个标签yAstty_{last}^{(t)}(在二进制标记的情况下)或𝐲Astt\mathbf{y}_{last}^{(t)}(在多类或多标签标记的情况下)。

You have already seen that an RNN can be used to label a sequence. At each time step tt, it reads an input feature vector 𝐱i(t)\mathbf{x}_i^{(t)}, and the last recurrent layer outputs a label ylast(t)y_{last}^{(t)} (in the case of binary labeling) or 𝐲last(t)\mathbf{y}_{last}^{(t)} (in the case of multiclass or multilabel labeling).

然而,RNN 并不是唯一可能的序列标记模型。称为条件随机场(CRF)的模型是一种非常有效的替代方案,在实践中对于具有许多信息特征的特征向量通常表现良好。例如,假设我们有命名实体提取的任务,我们想要构建一个模型,用以下类之一来标记句子中的每个单词,例如“我去旧金山”:{CAtn,nAe,CpAny_nAe,tHer}\{location, name, company\_name, other\}。如果我们的特征向量(代表单词)包含诸如“该单词是否以大写字母开头”和“该单词是否可以在位置列表中找到”之类的二元特征,那么这些特征将提供非常丰富的信息,并且帮助将单词SanFrancisco分类为CAtnlocation

However, RNN is not the only possible model for sequence labeling. The model called Conditional Random Fields (CRF) is a very effective alternative that often performs well in practice for the feature vectors that have many informative features. For example, imagine we have the task of named entity extraction and we want to build a model that would label each word in the sentence such as “I go to San Francisco” with one of the following classes: {location,name,company_name,other}\{location, name, company\_name, other\}. If our feature vectors (which represent words) contain such binary features as “whether or not the word starts with a capital letter” and “whether or not the word can be found in the list of locations,” such features would be very informative and help to classify the words San and Francisco as locationlocation.

众所周知,构建手工制作的功能是一个劳动密集型的过程,需要大量的领域专业知识。

Building handcrafted features is known to be a labor-intensive process that requires a significant level of domain expertise.

CRF 是一个有趣的模型,可以看作是逻辑回归到序列的推广。然而,在实践中,对于序列标记任务,它已被双向深度门控 RNN 超越。 CRF 的训练速度也明显较慢,这使得它们很难应用于大型训练集(包含数十万个示例)。此外,大型训练集是深度神经网络蓬勃发展的基础。

CRF is an interesting model and can be seen as a generalization of logistic regression to sequences. However, in practice, for sequence labeling tasks, it has been outperformed by bidirectional deep gated RNN. CRFs are also significantly slower in training which makes them difficult to apply to large training sets (with hundreds of thousands of examples). Additionally, a large training set is where a deep neural network thrives.

7.7序列到序列学习

7.7 Sequence-to-Sequence Learning

序列到序列学习(通常缩写为 seq2seq 学习)是序列标记问题的推广。在seq2seq中,XX_iY_i可以有不同的长度。 seq2seq 模型已在机器翻译(例如,输入是英语句子,输出是相应的法语句子)、会话界面(其中输入是用户输入的问题,输出是机器回答)、文本摘要、拼写纠正等等。

Sequence-to-sequence learning (often abbreviated as seq2seq learning) is a generalization of the sequence labeling problem. In seq2seq, XiX_i and YiY_i can have different lengths. seq2seq models have found application in machine translation (where, for example, the input is an English sentence, and the output is the corresponding French sentence), conversational interfaces (where the input is a question typed by the user, and the output is the answer from the machine), text summarization, spelling correction, and many others.

目前,许多(但并非全部)seq2seq 学习问题最好通过神经网络来解决。 seq2seq 中使用的网络架构都有两部分:编码器解码器

Many but not all seq2seq learning problems are currently best solved by neural networks. The network architectures used in seq2seq all have two parts: an encoder and a decoder.

在 seq2seq 神经网络学习中,编码器是接受顺序输入的神经网络。它可以是 RNN,也可以是 CNN 或其他架构。编码器的作用是读取输入并生成某种状态(类似于 RNN 中的状态),可以将其视为机器可以处理的输入含义的数字表示。某些实体的含义,无论是图像、文本还是视频,通常是包含实数的向量或矩阵。在机器学习术语中,这个向量(或矩阵)称为输入的嵌入。

In seq2seq neural network learning, the encoder is a neural network that accepts sequential input. It can be an RNN, but also a CNN or some other architecture. The role of the encoder is to read the input and generate some sort of state (similar to the state in RNN) that can be seen as a numerical representation of the meaning of the input the machine can work with. The meaning of some entity, whether it be an image, a text or a video, is usually a vector or a matrix that contains real numbers. In machine learning jargon, this vector (or matrix) is called the embedding of the input.

解码器是另一个神经网络,它以嵌入作为输入,并能够生成一系列输出。正如您可能已经猜到的那样,嵌入来自编码器。为了产生输出序列,解码器采用序列输入特征向量的开头𝐱0\mathbf{x}^{(0)}(通常全为零),产生第一个输出𝐲1\mathbf{y}^{(1)},通过组合嵌入和输入来更新其状态𝐱0\mathbf{x}^{(0)},然后使用输出𝐲1\mathbf{y}^{(1)}作为它的下一个输入𝐱1\mathbf{x}^{(1)}。为简单起见,维数为𝐲t\mathbf{y}^{(t)}可以与以下相同𝐱t\mathbf{x}^{(t)};然而,这并不是绝对必要的。正如我们在第 6 章中看到的,RNN 的每一层都可以同时产生许多输出:其中一个可用于生成标签𝐲t\mathbf{y}^{(t)},而另一个不同维度的可以用作𝐱t\mathbf{x}^{(t)}

The decoder is another neural network that takes an embedding as input and is capable of generating a sequence of outputs. As you could have already guessed, that embedding comes from the encoder. To produce a sequence of outputs, the decoder takes a start of sequence input feature vector 𝐱(0)\mathbf{x}^{(0)} (typically all zeroes), produces the first output 𝐲(1)\mathbf{y}^{(1)}, updates its state by combining the embedding and the input 𝐱(0)\mathbf{x}^{(0)}, and then uses the output 𝐲(1)\mathbf{y}^{(1)} as its next input 𝐱(1)\mathbf{x}^{(1)}. For simplicity, the dimensionality of 𝐲(t)\mathbf{y}^{(t)} can be the same as that of 𝐱(t)\mathbf{x}^{(t)}; however, it is not strictly necessary. As we saw in Chapter 6, each layer of an RNN can produce many simultaneous outputs: one can be used to generate the label 𝐲(t)\mathbf{y}^{(t)}, while another one, of different dimensionality, can be used as the 𝐱(t)\mathbf{x}^{(t)}.

使用训练数据同时训练编码器和解码器。解码器输出处的错误通过反向传播传播到编码器。

Both encoder and decoder are trained simultaneously using the training data. The errors at the decoder output are propagated to the encoder via backpropagation.

传统的 seq2seq 架构如下图所示:

A traditional seq2seq architecture is illustrated below:

图 37:传统的 seq2seq 架构。嵌入通常由编码器最后一层的状态给出,从蓝色子网络传递到紫色子网络。
图 37:传统的 seq2seq 架构。嵌入通常由编码器最后一层的状态给出,从蓝色子网络传递到紫色子网络。

使用具有注意力的架构可以获得更准确的预测。注意力机制是通过一组附加参数来实现的,这些参数组合了来自编码器的一些信息(在 RNN 中,该信息是来自所有编码器时间步的最后一个循环层的状态向量列表)和解码器的当前状态,以生成标签。与门控单元和双向 RNN 相比,这可以更好地保留长期依赖关系。

More accurate predictions can be obtained using an architecture with attention. Attention mechanism is implemented by an additional set of parameters that combine some information from the encoder (in RNNs, this information is the list of state vectors of the last recurrent layer from all encoder time steps) and the current state of the decoder to generate the label. That allows for even better retention of long-term dependencies than provided by gated units and bidirectional RNN.

带有注意力机制的 seq2seq 架构如下图所示:

A seq2seq architecture with attention is illustrated below:

图 38:具有注意力机制的 seq2seq 架构。
图 38:具有注意力机制的 seq2seq 架构。

seq2seq 学习是一个相对较新的研究领域。新颖的网络架构会定期被发现和发布。训练此类架构可能具有挑战性,因为需要调整的超参数和其他架构决策的数量可能令人难以承受。请查阅本书的 wiki,了解最先进的材料、教程和代码示例。

seq2seq learning is a relatively new research domain. Novel network architectures are regularly discovered and published. Training such architectures can be challenging as the number of hyperparameters to tune and other architectural decisions can be overwhelming. Consult the book’s wiki for the state of the art material, tutorials and code samples.

7.8主动学习

7.8 Active Learning

主动学习是一种有趣的监督学习范式。当获取标记示例的成本较高时通常应用它。在医疗或金融领域经常出现这种情况,可能需要专家的意见来注释患者或客户的数据。这个想法是从相对较少的标记示例和大量未标记示例开始学习,然后仅标记那些对模型质量贡献最大的示例。

Active learning is an interesting supervised learning paradigm. It is usually applied when obtaining labeled examples is costly. That is often the case in the medical or financial domains, where the opinion of an expert may be required to annotate patients’ or customers’ data. The idea is to start learning with relatively few labeled examples, and a large number of unlabeled ones, and then label only those examples that contribute the most to the model quality.

主动学习有多种策略。这里,我们只讨论以下两个:

There are multiple strategies of active learning. Here, we discuss only the following two:

  1. 基于数据密度和不确定性,以及
  2. data density and uncertainty based, and
  3. 基于支持向量。
  4. support vector-based.

前一种策略应用当前模型Ff,使用现有的标记示例对每个剩余的未标记示例进行训练(或者,为了节省计算时间,对其中的一些随机样本进行训练)。对于每个未标记的示例𝐱\mathbf{x},计算以下重要性得分:densty𝐱nCertAntyF𝐱density(\mathbf{x})\cdot uncertainty_f(\mathbf{x})。密度反映了周围有多少个例子𝐱\mathbf{x}在其附近,同时nCertAntyF𝐱uncertainty_f(\mathbf{x})反映了模型预测的不确定性Ff是为了𝐱\mathbf{x}。在 sigmoid 的二元分类中,预测分数越接近0.50.5,预测越不确定。在SVM中,样本越接近决策边界,预测的不确定性就越大。

The former strategy applies the current model ff, trained using the existing labeled examples, to each of the remaining unlabelled examples (or, to save computing time, to some random sample of them). For each unlabeled example 𝐱\mathbf{x}, the following importance score is computed: density(𝐱)uncertaintyf(𝐱)density(\mathbf{x})\cdot uncertainty_f(\mathbf{x}). Density reflects how many examples surround 𝐱\mathbf{x} in its close neighborhood, while uncertaintyf(𝐱)uncertainty_f(\mathbf{x}) reflects how uncertain the prediction of the model ff is for 𝐱\mathbf{x}. In binary classification with sigmoid, the closer the prediction score is to 0.50.5, the more uncertain is the prediction. In SVM, the closer the example is to the decision boundary, the most uncertain is the prediction.

在多类分类中,可以用作不确定性的典型度量:

In multiclass classification, entropy can be used as a typical measure of uncertainty:

HF𝐱=-ΣC=1C普罗yC;F𝐱[普罗yC;F𝐱], \begin{split}\mathrm{H}_f(\mathbf{x})=-\sum_{c=1}^{C}\Pr(y^{(c)}; f(\mathbf{x}))\\ \cdot\ln \left[\Pr(y^{(c)}; f(\mathbf{x}))\right],\end{split}

Hf(𝐱)=c=1CPr(y(c);f(𝐱))ln[Pr(y(c);f(𝐱))], \begin{split}\mathrm{H}_f(\mathbf{x})=-\sum_{c=1}^{C}\Pr(y^{(c)}; f(\mathbf{x}))\\ \cdot\ln \left[\Pr(y^{(c)}; f(\mathbf{x}))\right],\end{split}

在哪里普罗yC;F𝐱\Pr(y^{(c)}; f(\mathbf{x}))是模型的概率得分Ff分配给班级yCy^{(c)}分类时𝐱\mathbf{x}。你可以看到,如果对于每个yCy^{(c)},FyC=1Cf(y^{(c)}) = \frac{1}{C}那么模型是最不确定的,熵达到最大值11;另一方面,如果对于某些yCy^{(c)},FyC=1f(y^{(c)}) = 1,那么模型对于该类是确定的yCy^{(c)}熵最小为00

where Pr(y(c);f(𝐱))\Pr(y^{(c)}; f(\mathbf{x})) is the probability score the model ff assigns to class y(c)y^{(c)} when classifying 𝐱\mathbf{x}. You can see that if for each y(c)y^{(c)}, f(y(c))=1Cf(y^{(c)}) = \frac{1}{C} then the model is the most uncertain and the entropy is at its maximum of 11; on the other hand, if for some y(c)y^{(c)}, f(y(c))=1f(y^{(c)}) = 1, then the model is certain about the class y(c)y^{(c)} and the entropy is at its minimum of 00.

示例的密度𝐱\mathbf{x}可以通过取距离的平均值来获得𝐱\mathbf{x}对它的每一个kk最近的邻居(与kk是一个超参数)。

Density for the example 𝐱\mathbf{x} can be obtained by taking the average of the distance from 𝐱\mathbf{x} to each of its kk nearest neighbors (with kk being a hyperparameter).

一旦我们知道了每个未标记示例的重要性得分,我们就会选择重要性得分最高的一个,并请专家对其进行注释。然后,我们将新的带注释的示例添加到训练集中,重建模型并继续该过程,直到满足某些停止标准。可以提前选择停止标准(根据可用预算向专家发出的最大请求数),或者取决于我们的模型根据某些指标的执行情况。

Once we know the importance score of each unlabeled example, we pick the one with the highest importance score and ask the expert to annotate it. Then we add the new annotated example to the training set, rebuild the model and continue the process until some stopping criterion is satisfied. A stopping criterion can be chosen in advance (the maximum number of requests to the expert based on the available budget) or depend on how well our model performs according to some metric.

基于支持向量的主动学习策略包括使用标记数据构建 SVM 模型。然后,我们要求专家注释最接近分隔两个类的超平面的未标记示例。这个想法是,如果示例最接近超平面,那么它是最不确定的,并且对减少真实(我们寻找的)超平面可能位于的可能位置的贡献最大。

The support vector-based active learning strategy consists in building an SVM model using the labeled data. We then ask our expert to annotate the unlabeled example that lies the closest to the hyperplane that separates the two classes. The idea is that if the example lies closest to the hyperplane, then it is the least certain and would contribute the most to the reduction of possible places where the true (the one we look for) hyperplane could lie.

一些主动学习策略可以包含向专家询问标签的成本。其他人则学会询问专家的意见。 “委员会查询”策略包括使用不同的方法训练多个模型,然后要求专家标记这些模型最不同意的示例。一些策略尝试选择示例进行标记,以便最大程度地减少模型的方差或偏差。

Some active learning strategies can incorporate the cost of asking an expert for a label. Others learn to ask expert’s opinion. The “query by committee” strategy consists of training multiple models using different methods and then asking an expert to label example on which those models disagree the most. Some strategies try to select examples to label so that the variance or the bias of the model are reduced the most.

7.9半监督学习

7.9 Semi-Supervised Learning

半监督学习(SSL)中,我们还标记了数据集的一小部分;其余大部分示例均未标记。我们的目标是利用大量未标记的示例来提高模型性能,而不需要额外的标记示例。

In semi-supervised learning (SSL) we also have labeled a small fraction of the dataset; most of the remaining examples are unlabeled. Our goal is to leverage a large number of unlabeled examples to improve the model performance without asking for additional labeled examples.

历史上,曾多次尝试解决这个问题。它们中没有一个能称得上受到普遍好评并在实践中经常使用。例如,一种经常被引用的 SSL 方法称为自学习。在自学习中,我们使用学习算法使用标记示例构建初始模型。然后我们将模型应用于所有未标记的示例,并使用该模型对它们进行标记。如果某些未标记示例的预测置信度得分𝐱\mathbf{x}高于某个阈值(通过实验选择),然后我们将此标记示例添加到我们的训练集中,重新训练模型并继续这样,直到满足停止标准。例如,如果模型的准确性在过去的过程中没有得到改善,我们可以停止m迭代。

Historically, there were multiple attempts at solving this problem. None of them could be called universally acclaimed and frequently used in practice. For example, one frequently cited SSL method is called self-learning. In self-learning, we use a learning algorithm to build the initial model using the labeled examples. Then we apply the model to all unlabeled examples and label them using the model. If the confidence score of prediction for some unlabeled example 𝐱\mathbf{x} is higher than some threshold (chosen experimentally), then we add this labeled example to our training set, retrain the model and continue like this until a stopping criterion is satisfied. We could stop, for example, if the accuracy of the model has not been improved during the last mm iterations.

与仅使用初始标记的数据集相比,上述方法可以给模型带来一些改进,但性能的提升通常并不令人印象深刻。此外,在实践中,模型的质量甚至可能会下降。这取决于数据来源的统计分布的属性,而该属性通常是未知的。

The above method can bring some improvement to the model compared to just using the initially labeled dataset, but the increase in performance usually is not impressive. Furthermore, in practice, the quality of the model could even decrease. That depends on the properties of the statistical distribution the data was drawn from, which is usually unknown.

另一方面,神经网络学习的最新进展带来了一些令人印象深刻的结果。例如,结果表明,对于某些数据集,例如 MNIST(计算机视觉中的常见测试平台,由 0 到 9 的手写数字的标记图像组成),以半监督方式训练的模型具有近乎完美的性能,只需每类 10 个标记示例(总共 100 个标记示例)。为了进行比较,MNIST 包含 70,000 个标记示例(60,000 个用于训练,10,000 个用于测试)。获得如此卓越性能的神经网络架构被称为梯形网络。要了解梯形网络,您必须了解自动编码器是什么。

On the other hand, the recent advancements in neural network learning brought some impressive results. For example, it was shown that for some datasets, such as MNIST (a frequent testbench in computer vision that consists of labeled images of handwritten digits from 0 to 9) the model trained in a semi-supervised way has an almost perfect performance with just 10 labeled examples per class (100 labeled examples overall). For comparison, MNIST contains 70,000 labeled examples (60,000 for training and 10,000 for test). The neural network architecture that attained such a remarkable performance is called a ladder network. To understand ladder networks you have to understand what an autoencoder is.

自动编码器是具有编码器-解码器架构的前馈神经网络。它被训练来重建其输入。所以训练样本是一对𝐱,𝐱(\mathbf{x},\mathbf{x})。我们想要输出𝐱̂\hat{\mathbf{x}}模型的F𝐱f(\mathbf{x})与输入相似𝐱\mathbf{x}尽可能。

An autoencoder is a feed-forward neural network with an encoder-decoder architecture. It is trained to reconstruct its input. So the training example is a pair (𝐱,𝐱)(\mathbf{x},\mathbf{x}). We want the output 𝐱̂\hat{\mathbf{x}} of the model f(𝐱)f(\mathbf{x}) to be as similar to the input 𝐱\mathbf{x} as possible.

这里的一个重要细节是,自动编码器的网络看起来像一个沙漏,中间有一个瓶颈层,其中包含DD-维输入向量;嵌入层的单元通常比DD。解码器的目标是根据该嵌入重建输入特征向量。理论上来说,只要有1010瓶颈层中的单元成功编码 MNIST 图像。在图 1 中示意性描述的典型自动编码器中。 如图39所示,成本函数通常是均方误差(当特征可以是任意数字时)或二元交叉熵(当特征是二元并且解码器最后一层的单元具有S形激活函数时)。如果成本是均方误差,则由下式给出:

An important detail here is that an autoencoder’s network looks like an hourglass with a bottleneck layer in the middle that contains the embedding of the DD-dimensional input vector; the embedding layer usually has much fewer units than DD. The goal of the decoder is to reconstruct the input feature vector from this embedding. Theoretically, it is sufficient to have 1010 units in the bottleneck layer to successfully encode MNIST images. In a typical autoencoder schematically depicted in fig. 39, the cost function is usually either the mean squared error (when features can be any number) or the binary cross-entropy (when features are binary and the units of the last layer of the decoder have the sigmoid activation function). If the cost is the mean squared error, then it is given by:

1Σ=1𝐱-F𝐱2, \frac{1}{N}\sum_{i=1}^N \|\mathbf{x}_i - f(\mathbf{x}_i)\|^2,

1Ni=1N𝐱if(𝐱i)2, \frac{1}{N}\sum_{i=1}^N \|\mathbf{x}_i - f(\mathbf{x}_i)\|^2,

在哪里𝐱-F𝐱\|\mathbf{x}_i - f(\mathbf{x}_i)\|是两个向量之间的欧几里德距离。

where 𝐱if(𝐱i)\|\mathbf{x}_i - f(\mathbf{x}_i)\| is the Euclidean distance between two vectors.

图 39:自动编码器。
图 39:自动编码器。

去噪自动编码器破坏了左侧𝐱\mathbf{x}在训练示例中𝐱,𝐱(\mathbf{x}, \mathbf{x})通过向特征添加一些随机扰动。如果我们的示例是灰度图像,像素表示为介于0011,通常会在每个特征中添加高斯噪声。对于每个功能jj输入特征向量的𝐱\mathbf{x}噪声值njn^{(j)}从高斯分布中采样:

A denoising autoencoder corrupts the left-hand side 𝐱\mathbf{x} in the training example (𝐱,𝐱)(\mathbf{x}, \mathbf{x}) by adding some random perturbation to the features. If our examples are grayscale images with pixels represented as values between 00 and 11, usually a Gaussian noise is added to each feature. For each feature jj of the input feature vector 𝐱\mathbf{x} the noise value n(j)n^{(j)} is sampled from the Gaussian distribution:

nj𝒩μ,σ2 n^{(j)} \sim \mathcal{N}(\mu, \sigma^{2}).

n(j)𝒩(μ,σ2). n^{(j)} \sim \mathcal{N}(\mu, \sigma^{2}).

其中符号\sim意思是“采样自”,并且𝒩μ,σ2\mathcal{N}(\mu, \sigma^{2})表示具有均值的高斯分布μ\mu和标准差σ\sigma其pdf由以下给出:

where the notation \sim means “sampled from,” and 𝒩(μ,σ2)\mathcal{N}(\mu, \sigma^{2}) denotes the Gaussian distribution with mean μ\mu and standard deviation σ\sigma whose pdf is given by:

F𝛉z=1σ2π经验值-z-μ22σ2 f_{\bm{\theta}}(z) = \frac{1}{\sigma{\sqrt{2\pi}}} \exp \left(-\frac{(z-\mu)^{2}}{2\sigma^{2}}\right).

f𝛉(z)=1σ2πexp((zμ)22σ2). f_{\bm{\theta}}(z) = \frac{1}{\sigma{\sqrt{2\pi}}} \exp \left(-\frac{(z-\mu)^{2}}{2\sigma^{2}}\right).

在上式中,π\pi是常数并且𝛉=定义[μ,σ]\bm{\theta} \stackrel{\text{def}}{=} [\mu,\sigma]是一个超参数。该功能的新的、损坏的值Xjx^{(j)}是(谁)给的Xj+njx^{(j)} + n^{(j)}

In the above equation, π\pi is the constant and 𝛉=def[μ,σ]\bm{\theta} \stackrel{\text{def}}{=} [\mu,\sigma] is a hyperparameter. The new, corrupted value of the feature x(j)x^{(j)} is given by x(j)+n(j)x^{(j)} + n^{(j)}.

梯形网络是一种升级版的去噪自动编码器。编码器和解码器具有相同的层数。瓶颈层直接用于预测标签(使用softmax激活函数)。该网络具有多个成本函数。对于每一层l编码器和相应层的l解码器,一笔成本CdC_d^l惩罚两层输出之间的差异(使用平方欧几里德距离)。当在训练期间使用标记示例时,另一个成本函数,CCC_c,惩罚标签预测中的错误(使用负对数似然成本函数)。组合成本函数,CC+Σ=1LλCdC_c + \sum_{l=1}^L\lambda_l C_d^l(对批次中所有示例进行平均),通过带有反向传播的小批量随机梯度下降进行优化。超参数λ\lambda_l对于每一层l确定分类和编码解码成本之间的权衡。

A ladder network is a denoising autoencoder with an upgrade. The encoder and the decoder have the same number of layers. The bottleneck layer is used directly to predict the label (using the softmax activation function). The network has several cost functions. For each layer ll of the encoder and the corresponding layer ll of the decoder, one cost CdlC_d^l penalizes the difference between the outputs of the two layers (using the squared Euclidean distance). When a labeled example is used during training, another cost function, CcC_c, penalizes the error in prediction of the label (the negative log-likelihood cost function is used). The combined cost function, Cc+l=1LλlCdlC_c + \sum_{l=1}^L\lambda_l C_d^l (averaged over all examples in the batch), is optimized by the minibatch stochastic gradient descent with backpropagation. The hyperparameters λl\lambda_l for each layer ll determine the tradeoff between the classification and encoding-decoding cost.

在梯形网络中,不仅输入被噪声破坏,而且每个编码器层的输出(在训练期间)也被破坏。当我们将训练好的模型应用到新的输入时𝐱\mathbf{x}为了预测它的标签,我们不会破坏输入。

In the ladder network, not just the input is corrupted with the noise, but also the output of each encoder layer (during training). When we apply the trained model to the new input 𝐱\mathbf{x} to predict its label, we do not corrupt the input.

还存在其他与训练神经网络无关的半监督学习技术。其中之一意味着使用标记数据构建模型,然后使用任何聚类技术将未标记和标记的示例聚类在一起(我们在第 9 章中考虑其中的一些技术)。对于每个新示例,我们将其所属集群中的多数标签输出为预测。

Other semi-supervised learning techniques, not related to training neural networks, exist. One of them implies building the model using the labeled data and then clustering the unlabeled and labeled examples together using any clustering technique (we consider some of them in Chapter 9). For each new example, we then output as a prediction the majority label in the cluster it belongs to.

另一种技术称为 S3VM,它基于使用 SVM。我们为未标记示例的每个可能标记构建一个 SVM 模型,然后选择具有最大余量的模型。关于 S3VM 的论文描述了一种方法,可以解决这个问题,而无需实际枚举所有可能的标签。

Another technique, called S3VM, is based on using SVM. We build one SVM model for each possible labeling of unlabeled examples and then we pick the model with the largest margin. The paper on S3VM describes an approach that allows solving this problem without actually enumerating all possible labelings.

7.10一次性学习

7.10 One-Shot Learning

如果不提及另外两个重要的监督学习范式,本章将是不完整的。其中之一是一次性学习。在通常应用于人脸识别的一次性学习中,我们希望建立一个模型,可以识别同一个人的两张照片代表同一个人。如果我们向模型展示两个不同人的两张照片,我们希望模型能够识别出这两个人是不同的。

This chapter would be incomplete without mentioning two other important supervised learning paradigms. One of them is one-shot learning. In one-shot learning, typically applied in face recognition, we want to build a model that can recognize that two photos of the same person represent that same person. If we present to the model two photos of two different people, we expect the model to recognize that the two people are different.

为了解决这个问题,我们可以采用传统的方式构建一个二元分类器,以两张图像作为输入并预测 true(当两张图片代表同一个人时)或 false(当两张图片属于不同的人时)。然而,在实践中,这将导致神经网络是典型神经网络的两倍,因为两张图片中的每一张都需要自己的嵌入子网络。训练这样的网络具有挑战性,不仅因为它的规模,而且因为正面的例子比负面的例子更难获得。所以这个问题是高度不平衡的。

To solve such a problem, we could go a traditional way and build a binary classifier that takes two images as input and predicts either true (when the two pictures represent the same person) or false (when the two pictures belong to different people). However, in practice, this would result in a neural network twice as big as a typical neural network, because each of the two pictures needs its own embedding subnetwork. Training such a network would be challenging not only because of its size but also because the positive examples would be much harder to obtain than negative ones. So the problem is highly imbalanced.

有效解决该问题的一种方法是训练孪生神经网络(SNN)。 SNN 可以实现为任何类型的神经网络:CNN、RNN 或 MLP。网络一次只接受一张图像作为输入;所以网络的规模并没有增加一倍。为了从仅以一张图片作为输入的网络中获得二元分类器“same_person”/“not_same”,我们以特殊的方式训练网络。

One way to effectively solve the problem is to train a siamese neural network (SNN). An SNN can be implemented as any kind of neural network, a CNN, an RNN, or an MLP. The network only takes one image as input at a time; so the size of the network is not doubled. To obtain a binary classifier “same_person”/“not_same” out of a network that only takes one picture as input, we train the networks in a special way.

为了训练 SNN,我们使用三元组损失函数。例如,让我们有一张人脸的三张图像:AA(对于锚),图像P(对于正片)和图像N(对于负数)。AAP是同一个人的两张不同照片;N是另一个人的照片。每个训练示例i现在是三胞胎A,,(A_i,P_i,N_i)

To train an SNN, we use the triplet loss function. For example, let us have three images of a face: image AA (for anchor), image PP (for positive) and image NN (for negative). AA and PP are two different pictures of the same person; NN is a picture of another person. Each training example ii is now a triplet (Ai,Pi,Ni)(A_i,P_i,N_i).

假设我们有一个神经网络模型Ff可以将面部图片作为输入并输出该图片的嵌入。以三元组损失为例i定义为,

Let’s say we have a neural network model ff that can take a picture of a face as input and output an embedding of this picture. The triplet loss for example ii is defined as,

最大限度FA-F2-FA-F2+α,023 \begin{split}\max(\|f(A_i)-f(P_i)\|^2 \\ - \|f(A_i)-f(N_i)\|^2 + \alpha, 0).\end{split} \qquad(23)

max(f(Ai)f(Pi)2f(Ai)f(Ni)2+α,0).(23) \begin{split}\max(\|f(A_i)-f(P_i)\|^2 \\ - \|f(A_i)-f(N_i)\|^2 + \alpha, 0).\end{split} \qquad(23)

成本函数定义为平均三元组损失:

The cost function is defined as the average triplet loss:

1Σ=1最大限度FA-F2-FA-F2+α,0, \begin{split}\frac{1}{N}\sum_{i=1}^N \max(\|f(A_i)-f(P_i)\|^2 \\ - \|f(A_i)-f(N_i)\|^2 + \alpha, 0),\end{split}

1Ni=1Nmax(f(Ai)f(Pi)2f(Ai)f(Ni)2+α,0), \begin{split}\frac{1}{N}\sum_{i=1}^N \max(\|f(A_i)-f(P_i)\|^2 \\ - \|f(A_i)-f(N_i)\|^2 + \alpha, 0),\end{split}

在哪里α\alpha是一个正超参数。直观地说,FA-F2\|f(A)-f(P)\|^2当我们的神经网络输出相似的嵌入向量时较低AAP;FA-F2\|f(A_i)-f(N_i)\|^2当两个不同人的图片的嵌入不同时,该值较高。如果我们的模型按照我们想要的方式工作,那么术语=FA-F2-FA-F2m = \|f(A_i)-f(P_i)\|^2 - \|f(A_i)-f(N_i)\|^2总是负数,因为我们用小值减去高值。通过设置α\alpha更高,我们强制该术语m甚至更小,以确保模型学会以高裕度识别两张相同的面孔和两张不同的面孔。如果m不够小,那么因为α\alpha成本将为正,模型参数将在反向传播中调整。

where α\alpha is a positive hyperparameter. Intuitively, f(A)f(P)2\|f(A)-f(P)\|^2 is low when our neural network outputs similar embedding vectors for AA and PP; f(Ai)f(Ni)2\|f(A_i)-f(N_i)\|^2 is high when the embedding for pictures of two different people are different. If our model works the way we want, then the term m=f(Ai)f(Pi)2f(Ai)f(Ni)2m = \|f(A_i)-f(P_i)\|^2 - \|f(A_i)-f(N_i)\|^2 will always be negative, because we subtract a high value from a small value. By setting α\alpha higher, we force the term mm to be even smaller, to make sure that the model learned to recognize the two same faces and two different faces with a high margin. If mm is not small enough, then because of α\alpha the cost will be positive, and the model parameters will be adjusted in backpropagation.

而不是随机选择图像N,创建三元组进行训练的更好方法是在经过几个时期的学习后使用当前模型并找到候选者N类似于AAP根据该模型。使用随机示例作为N会显着减慢训练速度,因为神经网络很容易看到两个随机人的图片之间的差异,因此大多数时候平均三元组损失会很低,并且参数更新得不够快。

Rather than randomly choosing an image for NN, a better way to create triplets for training is to use the current model after several epochs of learning and find candidates for NN that are similar to AA and PP according to that model. Using random examples as NN would significantly slow down the training because the neural network will easily see the difference between pictures of two random people, so the average triplet loss will be low most of the time and the parameters will not be updated fast enough.

为了构建 SNN,我们首先决定神经网络的架构。例如,如果我们的输入是图像,则 CNN 是典型的选择。举一个例子,为了计算平均三元组损失,我们连续应用该模型AA,然后到P,然后到N,然后我们使用等式计算该示例的损失:  23 .我们对批次中的所有三元组重复此操作,然后计算成本;反向传播的梯度下降通过网络传播成本来更新其参数。

To build an SNN, we first decide on the architecture of our neural network. For example, CNN is a typical choice if our inputs are images. Given an example, to calculate the average triplet loss, we apply, consecutively, the model to AA, then to PP, then to NN, and then we compute the loss for that example using eq. 23. We repeat that for all triplets in the batch and then compute the cost; gradient descent with backpropagation propagates the cost through the network to update its parameters.

这是一种常见的误解,即对于一次性学习,我们只需要每个实体的一个示例进行训练。在实践中,为了使人员识别模型准确,我们需要每个人有多个示例。之所以称为一次性,是因为这种模型最常见的应用是:基于面部的身份验证。例如,这样的模型可用于解锁您的手机。如果你的模型好,那么你只需要手机上有一张你的照片,它就能认出你,而且它也会认出别人不是你。当我们有了模型后,来决定是否两张图片AAÂ\hat{A}属于同一个人,我们检查是否FA-FÂ2\|f(A)-f(\hat{A})\|^2小于τ\tau,一个超参数。

It’s a common misconception that for one-shot learning we need only one example of each entity for training. In practice, we need more than one example of each person for the person identification model to be accurate. It’s called one-shot because of the most frequent application of such a model: face-based authentication. For example, such a model could be used to unlock your phone. If your model is good, then you only need to have one picture of you on your phone and it will recognize you, and also it will recognize that someone else is not you. When we have the model, to decide whether two pictures AA and Â\hat{A} belong to the same person, we check if f(A)f(Â)2\|f(A)-f(\hat{A})\|^2 is less than τ\tau, a hyperparameter.

7.11零样本学习

7.11 Zero-Shot Learning

我以零样本学习的方式完成了这一章。这是一个相对较新的研究领域,因此还没有任何算法被证明具有显着的实用性。因此,我在这里只概述基本思想,并将各种算法的细节留给进一步阅读。在零样本学习(ZSL)中,我们想要训练一个模型来为对象分配标签。最常见的应用是学习为图像分配标签。

I finish this chapter with zero-shot learning. It is a relatively new research area, so there are no algorithms that proved to have a significant practical utility yet. Therefore, I only outline here the basic idea and leave the details of various algorithms for further reading. In zero-shot learning (ZSL) we want to train a model to assign labels to objects. The most frequent application is to learn to assign labels to images.

然而,与标准分类相反,我们希望模型能够预测训练数据中没有的标签。这怎么可能?

However, contrary to standard classification, we want the model to be able to predict labels that we didn’t have in the training data. How is that possible?

诀窍是使用嵌入而不仅仅是表示输入𝐱\mathbf{x}还代表输出yy。想象一下,我们有一个模型,对于英语中的任何单词都可以生成具有以下属性的嵌入向量:如果一个单词yy_i与这个词有相似的含义yky_k,那么这两个单词的嵌入向量将相似。例如,如果yy_i巴黎yky_kRome,那么它们将具有相似的嵌入;另一方面,如果yky_k马铃薯,那么嵌入yy_iyky_k将会有所不同。这种嵌入向量称为词嵌入,通常使用余弦相似度度量1来比较它们。

The trick is to use embeddings not just to represent the input 𝐱\mathbf{x} but also to represent the output yy. Imagine that we have a model that for any word in English can generate an embedding vector with the following property: if a word yiy_i has a similar meaning to the word yky_k, then the embedding vectors for these two words will be similar. For example, if yiy_i is Paris and yky_k is Rome, then they will have embeddings that are similar; on the other hand, if yky_k is potato, then the embeddings of yiy_i and yky_k will be dissimilar. Such embedding vectors are called word embeddings, and they are usually compared using cosine similarity metrics1.

词嵌入具有这样的属性,即嵌入的每个维度都代表词含义的特定特征。例如,如果我们的词嵌入有四个维度(通常它们更宽,在 50 到 300 维之间),那么这四个维度可以代表诸如动物性抽象性酸味黄色等含义特征(是的,听起来很有趣,但这只是一个例子)。所以“bee”这个词会有这样的嵌入[1,0,0,1][1,0,0,1],黄色这个词就像这样[0,1,0,1][0,1,0,1],独角兽这个词是这样的[1,1,0,0][1,1,0,0]。每个嵌入的值是使用应用于庞大文本语料库的特定训练程序获得的。

Word embeddings have such a property that each dimension of the embedding represents a specific feature of the meaning of the word. For example, if our word embedding has four dimensions (usually they are much wider, between 50 and 300 dimensions), then these four dimensions could represent such features of the meaning as animalness, abstractness, sourness, and yellowness (yes, sounds funny, but it’s just an example). So the word bee would have an embedding like this [1,0,0,1][1,0,0,1], the word yellow like this [0,1,0,1][0,1,0,1], the word unicorn like this [1,1,0,0][1,1,0,0]. The values for each embedding are obtained using a specific training procedure applied to a vast text corpus.

现在,在我们的分类问题中,我们可以替换标签yy_i对于每个例子i在我们的训练集中使用其词嵌入并训练一个预测词嵌入的多标签模型。获取新示例的标签𝐱\mathbf{x},我们应用我们的模型Ff𝐱\mathbf{x},得到嵌入𝐲̂\hat{\mathbf{y}}然后在所有英语单词中搜索那些嵌入最相似的单词𝐲̂\hat{\mathbf{y}}使用余弦相似度。

Now, in our classification problem, we can replace the label yiy_i for each example ii in our training set with its word embedding and train a multi-label model that predicts word embeddings. To get the label for a new example 𝐱\mathbf{x}, we apply our model ff to 𝐱\mathbf{x}, get the embedding 𝐲̂\hat{\mathbf{y}} and then search among all English words those whose embeddings are the most similar to 𝐲̂\hat{\mathbf{y}} using cosine similarity.

为什么这样有效?以斑马为例。它是白色的,是哺乳动物,有条纹。以小丑鱼为例:它是橙色的,不是哺乳动物,并且有条纹。现在以一只老虎为例:它是橙色的,有条纹,而且是哺乳动物。如果这三个特征出现在词嵌入中,CNN 将学会检测图片中的这些相同特征。即使训练数据中不存在老虎标签,但包含斑马和小丑鱼等其他物体,那么 CNN 很可能会学习哺乳动物橙色条纹的概念来预测这些物体的标签。一旦我们向模型呈现老虎的图片,这些特征就会从图像中正确识别,并且英语词典中与预测嵌入最接近的单词嵌入很可能就是“ tiger”

Why does that work? Take a zebra for example. It is white, it is a mammal, and it has stripes. Take a clownfish: it is orange, not a mammal, and has stripes. Now take a tiger: it is orange, it has stripes, and it is a mammal. If these three features are present in word embeddings, the CNN would learn to detect these same features in pictures. Even if the label tiger was absent in the training data, but other objects including zebras and clownfish were, then the CNN will most likely learn the notion of mammalness, orangeness, and stripeness to predict labels of those objects. Once we present the picture of a tiger to the model, those features will be correctly identified from the image and most likely the closest word embedding from our English dictionary to the predicted embedding will be that of tiger.


  1. 我将在第 10 章展示如何从数据中学习单词嵌入。

  2. I will show in Chapter 10 how to learn words embeddings from data.

8高级练习

8 Advanced Practice

本章包含对某些技术的描述,您可能会发现这些技术对您的实践很有用。之所以称为“高级实践”,并不是因为所介绍的技术更复杂,而是因为它们应用于一些非常具体的环境中。在许多实际情况下,您很可能不需要求助于使用这些技术,但有时它们非常有帮助。

This chapter contains the description of techniques that you could find useful in your practice in some contexts. It’s called “Advanced Practice” not because the presented techniques are more complex, but rather because they are applied in some very specific contexts. In many practical situations, you will most likely not need to resort to using these techniques, but sometimes they are very helpful.

8.1处理不平衡数据集

8.1 Handling Imbalanced Datasets

通常在实践中,某些类别的示例在您的训练数据中代表性不足。例如,当您的分类器必须区分真实和欺诈性电子商务交易时,就是这种情况:真实交易的例子要频繁得多。如果您使用具有软间隔的 SVM,则可以为错误分类的示例定义成本。由于训练数据中始终存在噪声,因此许多真实交易的示例很可能会增加成本,从而最终出现在决策边界的错误一侧。

Often in practice, examples of some class will be underrepresented in your training data. This is the case, for example, when your classifier has to distinguish between genuine and fraudulent e-commerce transactions: the examples of genuine transactions are much more frequent. If you use SVM with soft margin, you can define a cost for misclassified examples. Because noise is always present in the training data, there are high chances that many examples of genuine transactions would end up on the wrong side of the decision boundary by contributing to the cost.

SVM算法尝试移动超平面以尽可能避免错误分类的示例。为了正确分类更多的多数类别的例子,占少数的“欺诈”例子有被错误分类的风险。这种情况如下图所示:

The SVM algorithm tries to move the hyperplane to avoid misclassified examples as much as possible. The “fraudulent” examples, which are in the minority, risk being misclassified in order to classify more numerous examples of the majority class correctly. This situation is illustrated below:

图 40:不平衡问题的说明:两个类别具有相同的权重。
图 40:不平衡问题的说明:两个类别具有相同的权重。

大多数应用于不平衡数据集的学习算法都会出现这个问题。

This problem is observed for most learning algorithms applied to imbalanced datasets.

如果将少数类示例的错误分类成本设置得较高,那么模型将更加努力地避免对这些示例进行错误分类,但这会产生多数类的某些示例的错误分类成本,如下所示:

If you set the cost of misclassification of examples of the minority class higher, then the model will try harder to avoid misclassifying those examples, but this will incur the cost of misclassification of some examples of the majority class, as illustrated below:

图 41:不平衡问题的说明:少数类别的示例具有较高的权重。
图 41:不平衡问题的说明:少数类别的示例具有较高的权重。

某些 SVM 实现允许您为每个类别提供权重。学习算法在寻找最佳超平面时会考虑这些信息。

Some SVM implementations allow you to provide weights for every class. The learning algorithm takes this information into account when looking for the best hyperplane.

如果学习算法不允许对类别进行加权,您可以尝试过采样技术。它包括通过制作某个类的示例的多个副本来提高该类示例的重要性。

If a learning algorithm doesn’t allow weighting classes, you can try the technique of oversampling. It consists of increasing the importance of examples of some class by making multiple copies of the examples of that class.

另一种相反的方法是欠采样,即从训练集中随机删除多数类别的一些示例。

An opposite approach, undersampling, is to randomly remove from the training set some examples of the majority class.

您还可以尝试通过随机采样少数类的几个示例的特征值并将它们组合起来以获得该类的新示例来创建合成示例。有两种流行的算法通过创建合成示例来对少数类进行过采样:合成少数过采样技术(SMOTE) 和自适应合成采样方法(ADASYN)。

You might also try to create synthetic examples by randomly sampling feature values of several examples of the minority class and combining them to obtain a new example of that class. There are two popular algorithms that oversample the minority class by creating synthetic examples: the synthetic minority oversampling technique (SMOTE) and the adaptive synthetic sampling method (ADASYN).

SMOTE 和 ADASYN 在许多方面的工作原理相似。对于给定的例子𝐱\mathbf{x}_i少数群体中,他们选择kk这个例子的最近邻居(让我们表示这组kk例子𝒮k\mathcal{S}_k)然后创建一个综合示例𝐱new\mathbf{x}_{new}作为𝐱+λ𝐱z-𝐱\mathbf{x}_i + \lambda (\mathbf{x}_{zi} - \mathbf{x}_{i}), 在哪里𝐱z\mathbf{x}_{zi}是随机选择的少数群体的一个例子𝒮k\mathcal{S}_k。插值超参数λ\lambda是范围内的随机数[0,1][0,1]

SMOTE and ADASYN work similarly in many ways. For a given example 𝐱i\mathbf{x}_i of the minority class, they pick kk nearest neighbors of this example (let’s denote this set of kk examples 𝒮k\mathcal{S}_k) and then create a synthetic example 𝐱new\mathbf{x}_{new} as 𝐱i+λ(𝐱zi𝐱i)\mathbf{x}_i + \lambda (\mathbf{x}_{zi} - \mathbf{x}_{i}), where 𝐱zi\mathbf{x}_{zi} is an example of the minority class chosen randomly from 𝒮k\mathcal{S}_k. The interpolation hyperparameter λ\lambda is a random number in the range [0,1][0,1].

SMOTE 和 ADASYN 都随机选择所有可能的𝐱\mathbf{x}_i在数据集中。在 ADASYN 中,为每个生成的合成示例的数量𝐱\mathbf{x}_i与示例的数量成正比𝒮k\mathcal{S}_k他们不属于少数阶层。因此,在少数类别的例子很少的领域会产生更多的综合例子。

Both SMOTE and ADASYN randomly pick all possible 𝐱i\mathbf{x}_i in the dataset. In ADASYN, the number of synthetic examples generated for each 𝐱i\mathbf{x}_i is proportional to the number of examples in 𝒮k\mathcal{S}_k which are not from the minority class. Therefore, more synthetic examples are generated in the area where the examples of the minority class are rare.

有些算法对不平衡数据集的问题不太敏感。决策树以及随机森林和梯度提升通常在不平衡数据集上表现良好。

Some algorithms are less sensitive to the problem of an imbalanced dataset. Decision trees, as well as random forest and gradient boosting, often perform well on imbalanced datasets.

8.2组合模型

8.2 Combining Models

集成算法(如随机森林)通常结合具有相同性质的模型。它们通过组合数百个弱模型来提高性能。在实践中,我们有时可以通过结合使用不同学习算法创建的强大模型来获得额外的性能增益。在这种情况下,我们通常只使用两个或三个模型。

Ensemble algorithms, like Random Forest, typically combine models of the same nature. They boost performance by combining hundreds of weak models. In practice, we can sometimes get an additional performance gain by combining strong models made with different learning algorithms. In this case, we usually use only two or three models.

组合模型的三种典型方法是 1) 平均、2) 多数投票和 3) 堆叠。

Three typical ways to combine models are 1) averaging, 2) majority vote and 3) stacking.

平均适用于回归以及返回分类分数的分类模型。您只需将所有模型(我们称之为基本模型)应用到输入𝐱\mathbf{x}然后对预测进行平均。要查看平均模型是否比每个单独的算法效果更好,您可以使用您选择的指标在验证集上对其进行测试。

Averaging works for regression as well as those classification models that return classification scores. You simply apply all your models—let’s call them base models—to the input 𝐱\mathbf{x} and then average the predictions. To see if the averaged model works better than each individual algorithm, you test it on the validation set using a metric of your choice.

多数投票适用于分类模型。您将所有基本模型应用于输入𝐱\mathbf{x}然后返回所有预测中的多数类。在平局的情况下,您可以随机选择其中一个类,或者返回一条错误消息(如果错误分类的事实会产生重大成本)。

Majority vote works for classification models. You apply all your base models to the input 𝐱\mathbf{x} and then return the majority class among all predictions. In the case of a tie, you either randomly pick one of the classes, or, you return an error message (if the fact of misclassifying would incur a significant cost).

堆叠包括构建一个元模型,该元模型将基本模型的输出作为输入。假设您想组合分类器F1f_1F2f_2,都预测同一组类。创建训练示例𝐱̂,ŷ(\hat{\mathbf{x}}_i, \hat{y_i})对于堆叠模型,设置𝐱̂=[F1𝐱,F2𝐱]\hat{\mathbf{x}}_i = [f_1(\mathbf{x}),f_2(\mathbf{x})]ŷ=y\hat{y}_i = y_i

Stacking consists of building a meta-model that takes the output of base models as input. Let’s say you want to combine classifiers f1f_1 and f2f_2, both predicting the same set of classes. To create a training example (𝐱̂i,yî)(\hat{\mathbf{x}}_i, \hat{y_i}) for the stacked model, set 𝐱̂i=[f1(𝐱),f2(𝐱)]\hat{\mathbf{x}}_i = [f_1(\mathbf{x}),f_2(\mathbf{x})] and ŷi=yi\hat{y}_i = y_i.

如果您的某些基本模型不仅返回一个类别,还返回每个类别的分数,您也可以使用这些值作为特征。

If some of your base models return not just a class, but also a score for each class, you can use these values as features too.

要训​​练堆叠模型,建议使用训练集中的示例并使用交叉验证调整堆叠模型的超参数。

To train the stacked model, it is recommended to use examples from the training set and tune the hyperparameters of the stacked model using cross-validation.

显然,您必须确保堆叠模型在验证集上的性能优于堆叠的每个基本模型。

Obviously, you have to make sure that your stacked model performs better on the validation set than each of the base models you stacked.

组合多个模型之所以能够带来更好的性能,是因为当几个不相关的强模型达成一致时,它们更有可能就正确的结果达成一致。这里的关键词是“不相关”。理想情况下,应该使用不同的特征或使用不同性质的算法来获得基础模型——例如支持向量机和随机森林。组合不同版本的决策树学习算法或多个具有不同超参数的 SVM 可能不会带来显着的性能提升。

The reason that combining multiple models can bring better performance is that when several uncorrelated strong models agree they are more likely to agree on the correct outcome. The keyword here is “uncorrelated.” Ideally, base models should be obtained using different features or using algorithms of a different nature — for example, SVMs and Random Forest. Combining different versions of the decision tree learning algorithm, or several SVMs with different hyperparameters, may not result in a significant performance boost.

8.3训练神经网络

8.3 Training Neural Networks

在神经网络训练中,一个具有挑战性的方面是如何将数据转换为网络可以使用的输入。如果您的输入是图像,首先,您必须调整所有图像的大小,使它们具有相同的尺寸。之后,通常首先对像素进行标准化,然后将其归一化到范围[0,1][0,1]

In neural network training, one challenging aspect is how to convert your data into the input the network can work with. If your input is images, first of all, you have to resize all images so that they have the same dimensions. After that, pixels are usually first standardized and then normalized to the range [0,1][0,1].

文本必须被标记化(即分成多个片段,例如单词、标点符号和其他符号)。对于CNN和RNN,每个token都使用one-hot编码转换为一个向量,因此文本变成了one-hot向量的列表。另一种通常更好的表示标记的方法是使用词嵌入。对于多层感知器,为了将文本转换为向量,词袋方法可能效果很好,特别是对于较大的文本(大于短信和推文)。

Texts have to be tokenized (that is, split into pieces, such as words, punctuation marks, and other symbols). For CNN and RNN, each token is converted into a vector using the one-hot encoding, so the text becomes a list of one-hot vectors. Another, often better way to represent tokens is by using word embeddings. For a multilayer perceptron, to convert texts to vectors the bag of words approach may work well, especially for larger texts (larger than SMS messages and tweets).

选择特定的神经网络架构是一件困难的事情。对于同一个问题,例如seq2seq学习,有多种架构,并且几乎每年都会提出新的架构。我建议使用 Google Scholar 或 Microsoft Academy 搜索引擎研究最先进的解决方案来解决您的问题,这些搜索引擎允许使用关键字和时间范围搜索科学出版物。如果您不介意使用不太现代的架构,我建议您在 GitHub 上寻找已实现的架构,并找到一种只需稍作修改即可应用于您的数据的架构。

The choice of specific neural network architecture is a difficult one. For the same problem, like seq2seq learning, there is a variety of architectures, and new ones are proposed almost every year. I recommend researching state of the art solutions for your problem using Google Scholar or Microsoft Academic search engines that allow searching for scientific publications using keywords and time range. If you don’t mind working with less modern architecture, I recommend looking for implemented architectures on GitHub and finding one that could be applied to your data with minor modifications.

在实践中,当您预处理、清理和标准化数据以及创建更大的训练集时,现代架构相对于旧架构的优势变得不那么重要。现代神经网络架构是来自多个实验室和公司的科学家合作的结果;自行实现此类模型可能非常复杂,并且通常需要大量计算能力来训练。花时间试图复制最近一篇科学论文的结果可能不值得。这段时间最好花在围绕不太现代但稳定的模型构建解决方案并获取更多训练数据上。

In practice, the advantage of a modern architecture over an older one becomes less significant as you preprocess, clean and normalize your data, and create a larger training set. Modern neural network architectures are a result of the collaboration of scientists from several labs and companies; such models could be very complex to implement on your own and usually require much computational power to train. Time spent trying to replicate results from a recent scientific paper may not be worth it. This time could better be spent on building the solution around a less modern but stable model and getting more training data.

一旦决定了网络的架构,您就必须决定层数、类型和大小。建议从一层或两层开始,训练模型并查看它是否很好地拟合训练数据(偏差较低)。如果不是,则逐渐增加每层的大小和层数,直到模型完美拟合训练数据。在这种情况下,如果模型在验证数据上表现不佳(具有较高方差),您应该向模型添加正则化。如果添加正则化后,模型不再适合训练数据,请稍微增加网络的大小。继续迭代,直到模型根据您的指标充分拟合训练和验证数据。

Once you decided on the architecture of your network, you have to decide on the number of layers, their type, and size. It is recommended to start with one or two layers, train a model and see if it fits the training data well (has a low bias). If not, gradually increase the size of each layer and the number of layers until the model perfectly fits the training data. Once this is the case, if the model doesn’t perform well on the validation data (has a high variance), you should add regularization to your model. If, after adding regularization, the model doesn’t fit the training data anymore, slightly increase the size of the network. Continue iteratively until the model fits both training and validation data well enough according to your metric.

8.4高级正则化

8.4 Advanced Regularization

在神经网络中,除了 L1 和 L2 正则化之外,您还可以使用神经网络特定的正则化器:dropoutearlystoppingbatchnormalization。后者从技术上讲并不是一种正则化技术,但它往往对模型具有正则化效果。

In neural networks, besides L1 and L2 regularization, you can use neural network specific regularizers: dropout, early stopping, and batch normalization. The latter is technically not a regularization technique, but it often has a regularization effect on the model.

Dropout的概念非常简单。每次通过网络运行训练示例时,您都会暂时从计算中随机排除一些单元。排除的单位百分比越高,正则化效果越高。神经网络库允许您在两个连续层之间添加 dropout 层,或者您可以为该层指定 dropout 参数。 dropout参数在范围内[0,1][0,1]并且必须通过根据验证数据进行调整来通过实验找到它。

The concept of dropout is very simple. Each time you run a training example through the network, you temporarily exclude at random some units from the computation. The higher the percentage of units excluded the higher the regularization effect. Neural network libraries allow you to add a dropout layer between two successive layers, or you can specify the dropout parameter for the layer. The dropout parameter is in the range [0,1][0,1] and it has to be found experimentally by tuning it on the validation data.

早期停止是通过在每个时期后保存初步模型并评估初步模型在验证集上的性能来训练神经网络的方法。正如您在第 4 章有关梯度下降的部分中所记得的那样,随着 epoch 数量的增加,成本会降低。成本的降低意味着模型能够很好地拟合训练数据。然而,在某个时刻,某个时代之后ee,模型可能开始过度拟合:成本不断下降,但模型在验证数据上的性能恶化。如果您将每个时期之后模型的版本保存在文件中,那么一旦开始观察到验证集的性能下降,您就可以停止训练。或者,您可以继续运行固定数量的训练过程,然后最终选择最佳模型。每个时期之后保存的模型称为检查点。一些机器学习从业者经常依赖这种技术;其他人尝试适当地规范模型以避免这种不良行为。

Early stopping is the way to train a neural network by saving the preliminary model after every epoch and assessing the performance of the preliminary model on the validation set. As you remember from the section about gradient descent in Chapter 4, as the number of epochs increases, the cost decreases. The decreased cost means that the model fits the training data well. However, at some point, after some epoch ee, the model can start overfitting: the cost keeps decreasing, but the performance of the model on the validation data deteriorates. If you keep, in a file, the version of the model after each epoch, you can stop the training once you start observing a decreased performance on the validation set. Alternatively, you can keep running the training process for a fixed number of epochs and then, in the end, you pick the best model. Models saved after each epoch are called checkpoints. Some machine learning practitioners rely on this technique very often; others try to properly regularize the model to avoid such an undesirable behavior.

批量归一化(必须称为批量标准化)是一种技术,包括在后续层的单元接收每层的输出作为输入之前对其进行标准化。在实践中,批量归一化可以带来更快、更稳定的训练,以及一些正则化效果。因此,尝试使用批量标准化总是一个好主意。在神经网络库中,您通常可以在两层之间插入批量归一化层。

Batch normalization (which rather has to be called batch standardization) is a technique that consists of standardizing the outputs of each layer before the units of the subsequent layer receive them as input. In practice, batch normalization results in faster and more stable training, as well as some regularization effect. So it’s always a good idea to try to use batch normalization. In neural network libraries, you can often insert a batch normalization layer between two layers.

另一种正则化技术不仅可以应用于神经网络,而且可以应用于几乎任何学习算法,称为数据增强。该技术通常用于规范处理图像的模型。获得原始标记训练集后,您可以通过对原始图像应用各种变换,从原始示例创建合成示例:稍微缩放、旋转、翻转、变暗等。您可以在这些合成示例中保留原始标签。在实践中,这通常会提高模型的性能。

Another regularization technique that can be applied not just to neural networks, but to virtually any learning algorithm, is called data augmentation. This technique is often used to regularize models that work with images. Once you have your original labeled training set, you can create a synthetic example from an original example by applying various transformations to the original image: zooming it slightly, rotating, flipping, darkening, and so on. You keep the original label in these synthetic examples. In practice, this often results in increased performance of the model.

8.5处理多个输入

8.5 Handling Multiple Inputs

在实践中,您通常会使用多模式数据。例如,您的输入可以是图像和文本,二进制输出可以指示文本是否描述了该图像。

Often in practice, you will work with multimodal data. For example, your input could be an image and text and the binary output could indicate whether the text describes this image.

很难使浅层学习算法适应多模态数据。然而,这并非不可能。您可以在图像上训练一个浅层模型,在文本上训练另一个模型。然后您可以使用我们上面讨论的模型组合技术。

It’s hard to adapt shallow learning algorithms to work with multimodal data. However, it’s not impossible. You could train one shallow model on the image and another one on the text. Then you can use a model combination technique we discussed above.

如果您无法将问题划分为两个独立的子问题,则可以尝试对每个输入进行向量化(通过应用相应的特征工程方法),然后简单地将两个特征向量连接在一起以形成一个更宽的特征向量。例如,如果您的图像具有以下特征[1,2,3][i^{(1)}, i^{(2)}, i^{(3)}]并且你的文字有特点[t1,t2,t3,t4][t^{(1)}, t^{(2)}, t^{(3)}, t^{(4)}]你的连接特征向量将是[1,2,3,t1,t2,t3,t4][i^{(1)}, i^{(2)}, i^{(3)}, t^{(1)}, t^{(2)}, t^{(3)}, t^{(4)}]

If you cannot divide your problem into two independent subproblems, you can try to vectorize each input (by applying the corresponding feature engineering method) and then simply concatenate two feature vectors together to form one wider feature vector. For example, if your image has features [i(1),i(2),i(3)][i^{(1)}, i^{(2)}, i^{(3)}] and your text has features [t(1),t(2),t(3),t(4)][t^{(1)}, t^{(2)}, t^{(3)}, t^{(4)}] your concatenated feature vector will be [i(1),i(2),i(3),t(1),t(2),t(3),t(4)][i^{(1)}, i^{(2)}, i^{(3)}, t^{(1)}, t^{(2)}, t^{(3)}, t^{(4)}].

使用神经网络,您将拥有更大的灵活性。您可以构建两个子网络,一个用于每种类型的输入。例如,CNN 子网络将读取图像,而 RNN 子网络将读取文本。两个子网络的最后一层都有嵌入:CNN 具有图像嵌入,而 RNN 具有文本嵌入。现在,您可以连接两个嵌入,然后在连接的嵌入之上添加一个分类层,例如 softmax 或 sigmoid。神经网络库提供了易于使用的工具,允许连接或平均来自多个子网络的层。

With neural networks, you have more flexibility. You can build two subnetworks, one for each type of input. For example, a CNN subnetwork would read the image while an RNN subnetwork would read the text. Both subnetworks have as their last layer an embedding: CNN has an embedding of the image, while RNN has an embedding of the text. You can now concatenate two embeddings and then add a classification layer, such as softmax or sigmoid, on top of the concatenated embeddings. Neural network libraries provide simple-to-use tools that allow concatenating or averaging of layers from several subnetworks.

8.6处理多个输出

8.6 Handling Multiple Outputs

在某些问题中,您希望预测一个输入的多个输出。我们在上一章中考虑了多标签分类。一些具有多个输出的问题可以有效地转化为多标签分类问题。特别是那些具有相同性质的标签(如标签)或假标签的标签可以创建为原始标签组合的完整枚举。

In some problems, you would like to predict multiple outputs for one input. We considered multi-label classification in the previous chapter. Some problems with multiple outputs can be effectively converted into a multi-label classification problem. Especially those that have labels of the same nature (like tags) or fake labels can be created as a full enumeration of combinations of original labels.

然而,在某些情况下,输出是多模态的,并且无法有效地枚举它们的组合。考虑以下示例:您想要构建一个模型来检测图像上的对象并返回其坐标。此外,模型必须返回描述对象的标签,例如“人”、“猫”或“仓鼠”。您的训练示例将是表示图像的特征向量。标签将表示为对象坐标向量和带有单热编码标签的另一个向量。

However, in some cases the outputs are multimodal, and their combinations cannot be effectively enumerated. Consider the following example: you want to build a model that detects an object on an image and returns its coordinates. In addition, the model has to return a tag describing the object, such as “person,” “cat,” or “hamster.” Your training example will be a feature vector that represents an image. The label will be represented as a vector of coordinates of the object and another vector with a one-hot encoded tag.

为了处理这种情况,您可以创建一个用作编码器的子网络。它将使用例如一个或多个卷积层来读取输入图像。编码器的最后一层是图像的嵌入。然后,在嵌入层顶部添加另外两个子网络:一个将嵌入向量作为输入并预测对象的坐标。第一个子网络可以将 ReLU 作为最后一层,这对于预测正实数(例如坐标)来说是一个不错的选择;该子网络可以使用均方误差成本C1C_1。第二个子网络将采用相同的嵌入向量作为输入并预测每个标签的概率。第二个子网络可以有一个 softmax 作为最后一层,它适合概率输出,并使用平均负对数似然成本C2C_2(也称为交叉熵成本)。

To handle a situation like that, you can create one subnetwork that would work as an encoder. It will read the input image using, for example, one or several convolution layers. The encoder’s last layer would be the embedding of the image. Then you add two other subnetworks on top of the embedding layer: one that takes the embedding vector as input and predicts the coordinates of an object. This first subnetwork can have a ReLU as the last layer, which is a good choice for predicting positive real numbers, such as coordinates; this subnetwork could use the mean squared error cost C1C_1. The second subnetwork will take the same embedding vector as input and predict the probabilities for each label. This second subnetwork can have a softmax as the last layer, which is appropriate for the probabilistic output, and use the averaged negative log-likelihood cost C2C_2 (also called cross-entropy cost).

显然,您对准确预测的坐标和标签都感兴趣。然而,同时优化两个成本函数是不可能的。通过尝试优化其中一个,您可能会损害第二个,反之亦然。您可以做的是添加另一个超参数γ\gamma在范围中0,1(0,1)并将组合成本函数定义为γC1+1-γC2\gamma C_1 + (1-\gamma) C_2。然后你调整值γ\gamma就像任何其他超参数一样在验证数据上。

Obviously, you are interested in both accurately predicted coordinates and the label. However, it is impossible to optimize the two cost functions at the same time. By trying to optimize one, you risk hurting the second one and the other way around. What you can do is add another hyperparameter γ\gamma in the range (0,1)(0,1) and define the combined cost function as γC1+(1γ)C2\gamma C_1 + (1-\gamma) C_2. Then you tune the value for γ\gamma on the validation data just like any other hyperparameter.

8.7迁移学习

8.7 Transfer Learning

迁移学习可能是神经网络相对于浅层模型具有独特优势的地方。在迁移学习中,您选择在某些数据集上训练的现有模型,然后调整该模型以预测另一个数据集中的示例,该数据集不同于构建该模型的数据集。第二个数据集与您用于验证和测试的保留集不同。它可能代表一些其他现象,或者正如机器学习科学家所说,它可能来自另一种统计分布。

Transfer learning is probably where neural networks have a unique advantage over the shallow models. In transfer learning, you pick an existing model trained on some dataset, and you adapt this model to predict examples from another dataset, different from the one the model was built on. This second dataset is not like holdout sets you use for validation and test. It may represent some other phenomenon, or, as machine learning scientists say, it may come from another statistical distribution.

例如,假设您已经训练模型在大型标记数据集中识别(并标记)野生动物。一段时间后,您还有另一个问题需要解决:您需要建立一个可以识别家畜的模型。使用浅层学习算法,您没有太多选择:您必须构建另一个大型标记数据集,现在用于家养动物。

For example, imagine you have trained your model to recognize (and label) wild animals on a big labeled dataset. After some time, you have another problem to solve: you need to build a model that would recognize domestic animals. With shallow learning algorithms, you do not have many options: you have to build another big labeled dataset, now for domestic animals.

对于神经网络来说,情况要有利得多。神经网络中的迁移学习的工作原理如下:

With neural networks, the situation is much more favorable. Transfer learning in neural networks works like this:

  1. 您在原始大数据集(野生动物)上构建深度模型。
  2. You build a deep model on the original big dataset (wild animals).
  3. 您为第二个模型(家养动物)编译了一个小得多的标记数据集。
  4. You compile a much smaller labeled dataset for your second model (domestic animals).
  5. 您从第一个模型中删除最后一层或几层。通常,这些层负责分类或回归;它们通常遵循嵌入层。
  6. You remove the last one or several layers from the first model. Usually, these are layers responsible for the classification or regression; they usually follow the embedding layer.
  7. 您可以用适合您的新问题的新层替换删除的层。
  8. You replace the removed layers with new layers adapted for your new problem.
  9. 您“冻结”第一个模型中剩余的层的参数。
  10. You “freeze” the parameters of the layers remaining from the first model.
  11. 您使用较小的标记数据集和梯度下降来仅训练新层的参数。
  12. You use your smaller labeled dataset and gradient descent to train the parameters of only the new layers.

通常,网上有大量针对视觉问题的深度模型。您可以找到一个很有可能对您的问题有用的模型,下载该模型,删除最后几层(要删除的层数是一个超参数),添加您自己的预测层并训练您的模型。

Usually, there is an abundance of deep models for visual problems available online. You can find one that has high chances to be of use for your problem, download that model, remove several last layers (the quantity of layers to remove is a hyperparameter), add your own prediction layers and train your model.

即使您没有现有模型,当您的问题需要获取成本非常高的标记数据集时,迁移学习仍然可以为您提供帮助,但您可以获得另一个更容易获得标签的数据集。假设您构建了一个文档分类模型。您从雇主那里获得了标签分类,其中包含一千个类别。在这种情况下,您需要付费请某人 a) 阅读、理解并记住类别之间的差异,b) 阅读多达一百万个文档并对其进行注释。

Even if you don’t have an existing model, transfer learning can still help you in situations when your problem requires a labeled dataset that is very costly to obtain, but you can get another dataset for which labels are more readily available. Let’s say you build a document classification model. You got the taxonomy of labels from your employer, and it contains a thousand categories. In this case, you would need to pay someone to a) read, understand and memorize the differences between categories and b) read up to a million documents and annotate them.

为了节省标记如此多的示例,您可以考虑使用维基百科页面作为数据集来构建您的第一个模型。维基百科页面的标签可以通过获取维基百科页面所属的类别来自动获取。一旦您的第一个模型学会了预测维基百科类别,您就可以“微调”该模型以预测雇主分类法的类别。与从头开始解决原始问题相比,您需要的雇主问题的带注释示例要少得多。

To save on labeling so many examples, you could consider using Wikipedia pages as the dataset to build your first model. The labels for a Wikipedia page can be obtained automatically by taking the category the Wikipedia page belongs to. Once your first model has learned to predict Wikipedia categories, you can “fine tune” this model to predict the categories of your employer’s taxonomy. You will need much fewer annotated examples for your employer’s problem than you would need if you started solving your original problem from scratch.

8.8算法效率

8.8 Algorithmic Efficiency

并非所有能够解决问题的算法都是实用的。有些可能太慢。有些问题可以通过快速算法解决;对于其他人来说,不存在快速算法。

Not all algorithms capable of solving a problem are practical. Some can be too slow. Some problems can be solved by a fast algorithm; for others, no fast algorithms can exist.

计算机科学的子领域称为算法分析,涉及确定和比较算法的复杂性。Big O 表示法用于根据算法的运行时间或空间需求如何随着输入大小的增长而增长来对算法进行分类。

The subfield of computer science called analysis of algorithms is concerned with determining and comparing the complexity of algorithms. Big O notation is used to classify algorithms according to how their running time or space requirements grow as the input size grows.

例如,假设我们有一个问题,就是在示例集中找到两个距离最远的一维示例𝒮\mathcal{S}尺寸的N。我们可以设计一种算法来解决这个问题,如下所示(此处和下文,用 Python 编写):

For example, let’s say we have the problem of finding the two most distant one-dimensional examples in the set of examples 𝒮\mathcal{S} of size NN. One algorithm we could craft to solve this problem would look like this (here and below, in Python):

在上面的算法中,我们循环遍历中的所有值𝒮\mathcal{S},并且在第一个循环的每次迭代中,我们循环遍历中的所有值𝒮\mathcal{S}再次。因此,上述算法使得2N^2数字的比较。如果我们把时间作为一个单位时间比较\operatorname{comparison},腹肌\operatorname{abs}任务\operatorname{assignment}运算,那么该算法的时间复杂度(或者简单地说,复杂度)最多为525N^2。 (在每次迭代中,我们有一个比较\operatorname{comparison}, 二腹肌\operatorname{abs}和两个任务\operatorname{assignment}当在最坏情况下测量算法的复杂性时,使用大 O 表示法。对于上面的算法,使用大O表示法,我们写出算法的复杂度为2O(N^2);常数,例如55,被忽略。

In the above algorithm, we loop over all values in 𝒮\mathcal{S}, and at every iteration of the first loop, we loop over all values in 𝒮\mathcal{S} once again. Therefore, the above algorithm makes N2N^2 comparisons of numbers. If we take as a unit time the time the comparison\operatorname{comparison}, abs\operatorname{abs} and assignment\operatorname{assignment} operations take, then the time complexity (or, simply, complexity) of this algorithm is at most 5N25N^2. (At each iteration, we have one comparison\operatorname{comparison}, two abs\operatorname{abs} and two assignment\operatorname{assignment} operations.) When the complexity of an algorithm is measured in the worst case, big O notation is used. For the above algorithm, using big O notation, we write that the algorithm’s complexity is O(N2)O(N^2); the constants, like 55, are ignored.

对于同样的问题,我们可以设计另一种算法,如下所示:

For the same problem, we can craft another algorithm like this:

在上面的算法中,我们循环遍历中的所有值𝒮\mathcal{S}只执行一次,所以算法的复杂度为O(N)。在这种情况下,我们说后一种算法比前一种算法更有效。

In the above algorithm, we loop over all values in 𝒮\mathcal{S} only once, so the algorithm’s complexity is O(N)O(N). In this case, we say that the latter algorithm is more efficient than the former.

当算法的复杂度是输入大小的多项式时,该算法被称为高效算法。因此两者O(N)2O(N^2)是有效的,因为N是一个次数多项式112N^2是一个次数多项式22。然而,对于非常大的输入,2O(N^2)算法可能会很慢。在大数据时代,科学家经常寻找日志O(\log N)算法。

An algorithm is called efficient when its complexity is polynomial in the size of the input. Therefore both O(N)O(N) and O(N2)O(N^2) are efficient because NN is a polynomial of degree 11 and N2N^2 is a polynomial of degree 22. However, for very large inputs, an O(N2)O(N^2) algorithm can be slow. In the big data era, scientists often look for O(logN)O(\log N) algorithms.

从实际的角度来看,当您实现算法时,应尽可能避免使用循环。例如,您应该使用矩阵和向量的运算,而不是循环。在Python中,计算𝐰𝐱\mathbf{w}\mathbf{x},你应该写,

From a practical standpoint, when you implement your algorithm, you should avoid using loops whenever possible. For example, you should use operations on matrices and vectors, instead of loops. In Python, to compute 𝐰𝐱\mathbf{w}\mathbf{x}, you should write,

并不是,

and not,

使用适当的数据结构。如果集合中元素的顺序不重要,请使用\operatorname{set}代替列表\operatorname{list}。 Python中验证特定示例是否存在的操作𝐱\mathbf{x}属于𝒮\mathcal{S}𝒮\mathcal{S}被声明为\operatorname{set}𝒮\mathcal{S}被声明为列表\operatorname{list}

Use appropriate data structures. If the order of elements in a collection doesn’t matter, use set\operatorname{set} instead of list\operatorname{list}. In Python, the operation of verifying whether a specific example 𝐱\mathbf{x} belongs to 𝒮\mathcal{S} is efficient when 𝒮\mathcal{S} is declared as a set\operatorname{set} and is inefficient when 𝒮\mathcal{S} is declared as a list\operatorname{list}.

另一个可以用来提高 Python 代码效率的重要数据结构是字典\operatorname{dict}。在其他语言中它被称为字典或哈希图。它允许您通过非常快速的键查找来定义键值对的集合。

Another important data structure that you can use to make your Python code more efficient is dict\operatorname{dict}. It is called a dictionary or a hashmap in other languages. It allows you to define a collection of key-value pairs with very fast lookups for keys.

除非您确切地知道自己在做什么,否则总是更喜欢使用流行的库而不是编写自己的科学代码。 numpy、scipy 和 scikit-learn 等科学 Python 包是由经验丰富的科学家和工程师以效率为中心构建的。他们有许多用 C 编程语言实现的方法,以实现最大效率。

Unless you know exactly what you do, always prefer using popular libraries to writing your own scientific code. Scientific Python packages like numpy, scipy, and scikit-learn were built by experienced scientists and engineers with efficiency in mind. They have many methods implemented in the C programming language for maximum efficiency.

如果需要迭代大量元素,请使用生成器来创建一次返回一个元素而不是一次返回所有元素的函数。

If you need to iterate over a vast collection of elements, use generators that create a function that returns one element at a time rather than all the elements at once.

使用Python 中的cProfile包来查找代码中的低效率问题。

Use the cProfile package in Python to find inefficiencies in your code.

最后,当从算法的角度来看您的代码无法改进时,您可以通过使用以下方法进一步提高代码的速度:

Finally, when nothing can be improved in your code from the algorithmic perspective, you can further boost the speed of your code by using:

  • 多处理包并行运行计算,以及

  • multiprocessing package to run computations in parallel, and

  • PyPyNumba或类似工具可将 Python 代码编译为快速、优化的机器代码。

  • PyPy, Numba or similar tools to compile your Python code into fast, optimized machine code.

9无监督学习

9 Unsupervised Learning

无监督学习处理数据没有标签的问题。这一特性对于许多应用来说都是一个很大的问题。缺乏代表模型所需行为的标签意味着缺乏可靠的参考点来判断模型的质量。在本书中,我仅介绍无监督学习方法,这些方法允许构建可以根据数据而不是人类判断进行评估的模型。

Unsupervised learning deals with problems in which data doesn’t have labels. That property makes it very problematic for many applications. The absence of labels representing the desired behavior for your model means the absence of a solid reference point to judge the quality of your model. In this book, I only present unsupervised learning methods that allow the building of models that can be evaluated based on data as opposed to human judgment.

9.1密度估计

9.1 Density Estimation

密度估计是对从中提取数据集的未知概率分布的概率密度函数 (pdf) 进行建模的问题。它可用于许多应用,特别是新颖性或入侵检测。在第 7 章中,我们已经估计了 pdf 来解决一类分类问题。为此,我们决定我们的模型将是参数化的,更准确地说是多元正态分布 (MVN)。这个决定有些武断,因为如果我们的数据集绘制的真实分布与 MVN 不同,我们的模型很可能远非完美。我们还知道模型可以是非参数的。我们在核回归中使用了非参数模型。事实证明,同样的方法也适用于密度估计。

Density estimation is a problem of modeling the probability density function (pdf) of the unknown probability distribution from which the dataset has been drawn. It can be useful for many applications, in particular for novelty or intrusion detection. In Chapter 7, we already estimated the pdf to solve the one-class classification problem. To do that, we decided that our model would be parametric, more precisely a multivariate normal distribution (MVN). This decision was somewhat arbitrary because if the real distribution from which our dataset was drawn is different from the MVN, our model will be very likely far from perfect. We also know that models can be nonparametric. We used a nonparametric model in kernel regression. It turns out that the same approach can work for density estimation.

{X}=1\{x_i\}_{i=1}^N是一个一维数据集(多维情况类似),其示例是从具有未知 pdf 的分布中抽取的FfXεx_i \in \mathbb {R}对全部=1,……,i=1,\dots,N。我们对建模的形状感兴趣Ff。我们的内核模型Ff,表示为F̂\hat{f}_b, 是(谁)给的,

Let {xi}i=1N\{x_i\}_{i=1}^N be a one-dimensional dataset (a multi-dimensional case is similar) whose examples were drawn from a distribution with an unknown pdf ff with xix_i \in \mathbb {R} for all i=1,,Ni=1,\dots,N. We are interested in modeling the shape of ff. Our kernel model of ff, denoted as f̂b\hat{f}_b, is given by,

F̂X=1Σ=1kX-X,24 \hat{f}_b(x)= \frac{1}{Nb}\sum_{i=1}^{N}k\left(\frac {x-x_i}{b}\right), \qquad(24)

f̂b(x)=1Nbi=1Nk(xxib),(24) \hat{f}_b(x)= \frac{1}{Nb}\sum_{i=1}^{N}k\left(\frac {x-x_i}{b}\right), \qquad(24)

在哪里b是一个超参数,控制我们模型的偏差和方差之间的权衡kk是一个内核。再次,像第 7 章一样,我们使用高斯核:

where bb is a hyperparameter that controls the tradeoff between bias and variance of our model and kk is a kernel. Again, like in Chapter 7, we use a Gaussian kernel:

kz=12π经验值-z22 k(z) =\frac{1}{\sqrt{2\pi}}\exp{\left(\frac{-z^2}{2}\right)}.

k(z)=12πexp(z22). k(z) =\frac{1}{\sqrt{2\pi}}\exp{\left(\frac{-z^2}{2}\right)}.

我们寻找这样一个值b最大限度地减少实际形状之间的差异Ff和我们模型的形状F̂\hat{f}_b。这种差异的合理选择称为均方误差(MISE):

We look for such a value of bb that minimizes the difference between the real shape of ff and the shape of our model f̂b\hat{f}_b. A reasonable choice of measure of this difference is called the mean integrated squared error (MISE):

管理信息系统=𝔼[F̂X-FX2dX]25 \operatorname{MISE}(b)=\mathbb{E}\left[\,\int_{\mathbb {R}} ({\hat{f}}_b(x)-f(x))^{2}\,dx\right].\qquad(25)

MISE(b)=𝔼[(f̂b(x)f(x))2dx].(25) \operatorname{MISE}(b)=\mathbb{E}\left[\,\int_{\mathbb {R}} ({\hat{f}}_b(x)-f(x))^{2}\,dx\right].\qquad(25)

直观上,你可以在等式中看到。  25我们对真实 pdf 之间的差进行平方Ff以及我们的模型F̂\hat{f}_b。积分\int_{\mathbb {R}}代替求和Σ=1\sum_{i=1}^N我们采用均方误差,而期望算子𝔼\mathbb{E}取代平均值1\frac{1}{N}

Intuitively, you see in eq. 25 that we square the difference between the real pdf ff and our model of it f̂b\hat{f}_b. The integral \int_{\mathbb {R}} replaces the summation i=1N\sum_{i=1}^N we employed in the mean squared error, while the expectation operator 𝔼\mathbb{E} replaces the average 1N\frac{1}{N}.

事实上,当我们的损失是一个具有连续域的函数时,例如F̂X-FX2({\hat{f}}_b(x)-f(x))^{2},我们必须用积分代替求和。期望操作𝔼\mathbb{E}意味着我们想要b对于我们的训练集的所有可能实现都是最佳的{X}=1\{x_i\}_{i=1}^N。这很重要,因为F̂{\hat{f}}_b是在某个概率分布的有限样本上定义的,而真实的 pdfFf定义在无限域上(集合\mathbb{R})。

Indeed, when our loss is a function with a continuous domain, such as (f̂b(x)f(x))2({\hat{f}}_b(x)-f(x))^{2}, we have to replace the summation with the integral. The expectation operation 𝔼\mathbb{E} means that we want bb to be optimal for all possible realizations of our training set {xi}i=1N\{x_i\}_{i=1}^N. That is important because f̂b{\hat{f}}_b is defined on a finite sample of some probability distribution, while the real pdf ff is defined on an infinite domain (the set \mathbb{R}).

现在,我们可以重写等式右侧的项。  25 个这样的:

Now, we can rewrite the right-hand side term in eq. 25 like this:

𝔼[F̂2XdX]-2𝔼[F̂XFXdX]+𝔼[FX2dX] \begin{split}\mathbb{E}\left[ \int_{\mathbb{R}}\hat{f}^2_b(x)dx\right] \\ - 2 \mathbb{E}\left[\int_{\mathbb{R}}\hat{f}_b(x)f(x)dx\right] \\ + \mathbb{E}\left[\int_{\mathbb{R}} f(x)^2 dx\right].\end{split}

𝔼[f̂b2(x)dx]2𝔼[f̂b(x)f(x)dx]+𝔼[f(x)2dx]. \begin{split}\mathbb{E}\left[ \int_{\mathbb{R}}\hat{f}^2_b(x)dx\right] \\ - 2 \mathbb{E}\left[\int_{\mathbb{R}}\hat{f}_b(x)f(x)dx\right] \\ + \mathbb{E}\left[\int_{\mathbb{R}} f(x)^2 dx\right].\end{split}

上述求和中的第三项独立于b因此可以忽略不计。第一项的无偏估计量由下式给出F̂2XdX\int_{\mathbb{R}} \hat{f}_b^2(x)dx而第二项的无偏估计量可以通过交叉验证来近似 -2Σ=1F̂X-\frac{2}{N}\sum_{i=1}^N \hat{f}^{(i)}_b(x_i), 在哪里F̂\hat{f}^{(i)}_b是一个内核模型Ff使用示例根据我们的训练集进行计算Xx_i排除。

The third term in the above summation is independent of bb and thus can be ignored. An unbiased estimator of the first term is given by f̂b2(x)dx\int_{\mathbb{R}} \hat{f}_b^2(x)dx while the unbiased estimator of the second term can be approximated by cross-validation 2Ni=1Nf̂b(i)(xi)-\frac{2}{N}\sum_{i=1}^N \hat{f}^{(i)}_b(x_i), where f̂b(i)\hat{f}^{(i)}_b is a kernel model of ff computed on our training set with the example xix_i excluded.

期限Σ=1F̂X\sum_{i=1}^N \hat{f}^{(i)}_b(x_i)在统计学中被称为留一估计,这是一种交叉验证的形式,其中每次折叠都包含一个示例。您可能已经注意到这个词F̂XFXdX\int_{\mathbb{R}}\hat{f}_b(x)f(x)dx(我们称之为Aa) 是函数的期望值F̂\hat{f}_b, 因为Ff是一个pdf文件。可以证明留一估计是一个无偏估计𝔼[A]\mathbb{E}\left[a\right]

The term i=1Nf̂b(i)(xi)\sum_{i=1}^N \hat{f}^{(i)}_b(x_i) is known in statistics as the leave one out estimate, a form of cross-validation in which each fold consists of one example. You could have noticed that the term f̂b(x)f(x)dx\int_{\mathbb{R}}\hat{f}_b(x)f(x)dx (let’s call it aa) is the expected value of the function f̂b\hat{f}_b, because ff is a pdf. It can be demonstrated that the leave one out estimate is an unbiased estimator of 𝔼[a]\mathbb{E}\left[a\right].

现在,寻找最优值*b^*为了b,我们最小化成本定义为,

Now, to find the optimal value b*b^* for bb, we minimize the cost defined as,

F̂2XdX-2Σ=1F̂X \int_{\mathbb{R}} \hat{f}_b^2(x)dx - \frac{2}{N}\sum_{i=1}^N \hat{f}^{(i)}_b(x_i).

f̂b2(x)dx2Ni=1Nf̂b(i)(xi). \int_{\mathbb{R}} \hat{f}_b^2(x)dx - \frac{2}{N}\sum_{i=1}^N \hat{f}^{(i)}_b(x_i).

我们可以找*b^*使用网格搜索。为了DD维特征向量𝐱\mathbf{x},误差项X-Xx-x_i在等式中 24可以用欧氏距离代替𝐱-𝐱\|\mathbf{x}-\mathbf{x}_i\|。在图中。  42-图。  44您可以看到使用三个不同值获得的相同 pdf 的估计值b从一个100100-示例数据集。

We can find b*b^* using grid search. For DD-dimensional feature vectors 𝐱\mathbf{x}, the error term xxix-x_i in eq. 24 can be replaced by the Euclidean distance 𝐱𝐱i\|\mathbf{x}-\mathbf{x}_i\|. In fig. 42-fig. 44 you can see the estimates for the same pdf obtained with three different values of bb from a 100100-example dataset.

图 42:核密度估计:拟合良好。
图 42:核密度估计:拟合良好。
图 43:核密度估计:过度拟合。
图 43:核密度估计:过度拟合。
图 44:核密度估计:欠拟合。
图 44:核密度估计:欠拟合。

对应的网格搜索曲线如下所示:

The corresponding grid search curve is shown below:

图 45:核密度估计:b 最佳值的网格搜索曲线。
图 45:核密度估计:网格搜索最佳值的曲线b

我们挑选*b^*位于网格搜索曲线的最小值。

We pick b*b^* at the minimum of the grid search curve.

9.2聚类

9.2 Clustering

聚类是学习通过利用未标记的数据集为示例分配标签的问题。由于数据集完全未标记,因此决定学习模型是否最优比监督学习要复杂得多。

Clustering is a problem of learning to assign a label to examples by leveraging an unlabeled dataset. Because the dataset is completely unlabeled, deciding on whether the learned model is optimal is much more complicated than in supervised learning.

聚类算法有很多种,不幸的是,很难判断哪一种算法更适合您的数据集。通常,每种算法的性能取决于数据集得出的概率分布的未知属性。在本章中,我概述了最有用和最广泛使用的聚类算法。

There is a variety of clustering algorithms, and, unfortunately, it’s hard to tell which one is better in quality for your dataset. Usually, the performance of each algorithm depends on the unknown properties of the probability distribution that the dataset was drawn from. In this Chapter, I outline the most useful and widely used clustering algorithms.

9.2.1 K-均值

9.2.1 K-Means

k -means聚类算法的工作原理如下。首先,你选择kk——簇的数量。然后你随机放kk特征向量(称为质心)到特征空间。

The k-means clustering algorithm works as follows. First, you choose kk — the number of clusters. Then you randomly put kk feature vectors, called centroids, to the feature space.

然后我们计算与每个示例的距离𝐱\mathbf{x}到每个质心𝐜\mathbf{c}使用一些度量,例如欧几里德距离。然后我们为每个示例分配最近的质心(就像我们用质心 id 作为标签来标记每个示例一样)。对于每个质心,我们计算用它标记的示例的平均特征向量。这些平均特征向量成为质心的新位置。

We then compute the distance from each example 𝐱\mathbf{x} to each centroid 𝐜\mathbf{c} using some metric, like the Euclidean distance. Then we assign the closest centroid to each example (like if we labeled each example with a centroid id as the label). For each centroid, we calculate the average feature vector of the examples labeled with it. These average feature vectors become the new locations of the centroids.

我们重新计算每个示例到每个质心的距离,修改分配并重复该过程,直到重新计算质心位置后分配不再改变。该模型是示例的质心 ID 分配列表。

We recompute the distance from each example to each centroid, modify the assignment and repeat the procedure until the assignments don’t change after the centroid locations were recomputed. The model is the list of assignments of centroids IDs to the examples.

质心的初始位置影响最终位置,因此两次运行 k 均值可能会产生两个不同的模型。 k 均值的某些变体根据数据集的某些属性计算质心的初始位置。

The initial position of centroids influence the final positions, so two runs of k-means can result in two different models. Some variants of k-means compute the initial positions of centroids based on some properties of the dataset.

k 均值算法的一次运行如下所示:

One run of the k-means algorithm is illustrated below:

图 46:k = 3 时 k-means 算法的进展。
图 46:k-means 算法的进展k=3k = 3

上图中的圆圈是二维特征向量;正方形的质心正在移动。不同的背景颜色代表所有点属于同一簇的区域。

The circles in the above figure are two-dimensional feature vectors; the squares are moving centroids. Different background colors represent regions in which all points belong to the same cluster.

的价值kk,集群的数量,是一个必须由数据分析师调整的超参数。有一些选择技巧kk。它们都没有被证明是最佳的。大多数这些技术要求分析师通过查看一些指标或通过直观地检查集群分配来做出“有根据的猜测”。在本章中,我提出了一种为kk无需查看数据并进行猜测。

The value of kk, the number of clusters, is a hyperparameter that has to be tuned by the data analyst. There are some techniques for selecting kk. None of them is proven optimal. Most of those techniques require the analyst to make an “educated guess” by looking at some metrics or by examining cluster assignments visually. In this chapter, I present one approach to choose a reasonably good value for kk without looking at the data and making guesses.

9.2.2 DBSCAN 和 HDBSCAN

9.2.2 DBSCAN and HDBSCAN

k 均值和类似算法是基于质心的,而DBSCAN是基于密度的聚类算法。通过使用 DBSCAN,您无需猜测需要多少个集群,而是定义两个超参数:ε\epsilonnn。你首先选择一个例子𝐱\mathbf{x}随机从您的数据集中并将其分配给集群11。然后计算有多少个例子距离𝐱\mathbf{x}小于或等于ε\epsilon。如果该数量大于或等于nn,然后你把所有这些ε\epsilon- 同一集群的邻居11。然后检查集群的每个成员11并找到各自的ε\epsilon-邻居。如果集群中的某个成员11nn或者更多ε\epsilon-邻居,你扩展集群11通过添加那些ε\epsilon- 集群的邻居。您继续扩展集群11直到没有更多的例子可以放入为止。在后一种情况下,您从数据集中选择不属于任何集群的另一个示例并将其放入集群中22。继续这样,直到所有示例都属于某个集群或被标记为异常值。异常值是一个例子,其ε\epsilon- 邻里包含少于nn例子。

While k-means and similar algorithms are centroid-based, DBSCAN is a density-based clustering algorithm. Instead of guessing how many clusters you need, by using DBSCAN, you define two hyperparameters: ϵ\epsilon and nn. You start by picking an example 𝐱\mathbf{x} from your dataset at random and assign it to cluster 11. Then you count how many examples have the distance from 𝐱\mathbf{x} less than or equal to ϵ\epsilon. If this quantity is greater than or equal to nn, then you put all these ϵ\epsilon-neighbors to the same cluster 11. You then examine each member of cluster 11 and find their respective ϵ\epsilon-neighbors. If some member of cluster 11 has nn or more ϵ\epsilon-neighbors, you expand cluster 11 by adding those ϵ\epsilon-neighbors to the cluster. You continue expanding cluster 11 until there are no more examples to put in it. In the latter case, you pick from the dataset another example not belonging to any cluster and put it to cluster 22. You continue like this until all examples either belong to some cluster or are marked as outliers. An outlier is an example whose ϵ\epsilon-neighborhood contains less than nn examples.

DBSCAN 的优点是它可以构建具有任意形状的簇,而 k 均值和其他基于质心的算法创建具有超球面形状的簇。 DBSCAN 的一个明显缺点是它有两个超参数并为它们选择好的值(尤其是ε\epsilon)可能具有挑战性。此外,具有ε\epsilon固定的,聚类算法不能有效地处理不同密度的聚类。

The advantage of DBSCAN is that it can build clusters that have an arbitrary shape, while k-means and other centroid-based algorithms create clusters that have a shape of a hypersphere. An obvious drawback of DBSCAN is that it has two hyperparameters and choosing good values for them (especially ϵ\epsilon) could be challenging. Furthermore, having ϵ\epsilon fixed, the clustering algorithm cannot effectively deal with clusters of varying density.

HDBSCAN是保留 DBSCAN 优点的聚类算法,无需决定ε\epsilon。该算法能够构建不同密度的集群。 HDBSCAN 是多种思想的巧妙结合,完整描述该算法超出了本书的范围。

HDBSCAN is the clustering algorithm that keeps the advantages of DBSCAN, by removing the need to decide on the value of ϵ\epsilon. The algorithm is capable of building clusters of varying density. HDBSCAN is an ingenious combination of multiple ideas and describing the algorithm in full is beyond the scope of this book.

HDBSCAN 只有一个重要的超参数:nn,集群中放置的最小示例数。这个超参数比较简单,凭直觉选择。 HDBSCAN 的实现速度非常快:它可以有效地处理数百万个示例。尽管 k 均值的现代实现比 HDBSCAN 快得多,但对于许多实际任务来说,后者的优点可能超过其缺点。我建议始终首先尝试对数据进行 HDBSCAN。

HDBSCAN only has one important hyperparameter: nn, the minimum number of examples to put in a cluster. This hyperparameter is relatively simple to choose by intuition. HDBSCAN has very fast implementations: it can deal with millions of examples effectively. Modern implementations of k-means are much faster than HDBSCAN, though, but the qualities of the latter may outweigh its drawbacks for many practical tasks. I recommend to always trying HDBSCAN on your data first.

9.2.3确定簇的数量

9.2.3 Determining the Number of Clusters

最重要的问题是您的数据集有多少个集群?当特征向量是一维、二维或三维时,您可以查看数据并看到特征空间中的点“云”。每个云都是一个潜在的集群。然而,对于DD- 维度数据,D>3D > 3、看数据有问题1

The most important question is how many clusters does your dataset have? When the feature vectors are one-, two- or three-dimensional, you can look at the data and see “clouds” of points in the feature space. Each cloud is a potential cluster. However, for DD-dimensional data, with D>3D > 3, looking at the data is problematic1.

确定合理簇数的一种方法是基于预测强度的概念。这个想法是将数据分为训练集和测试集,类似于我们在监督学习中的做法。一旦你有了训练集和测试集,𝒮tr\mathcal{S}_{tr}尺寸的trN_{tr}𝒮te\mathcal{S}_{te}尺寸的teN_{te}分别,你修复kk,聚类数量,并运行聚类算法CC在片场𝒮tr\mathcal{S}_{tr}𝒮te\mathcal{S}_{te}并得到聚类结果C𝒮tr,kC(\mathcal{S}_{tr},k)C𝒮te,kC(\mathcal{S}_{te},k)

One way of determining the reasonable number of clusters is based on the concept of prediction strength. The idea is to split the data into training and test set, similarly to how we do in supervised learning. Once you have the training and test sets, 𝒮tr\mathcal{S}_{tr} of size NtrN_{tr} and 𝒮te\mathcal{S}_{te} of size NteN_{te} respectively, you fix kk, the number of clusters, and run a clustering algorithm CC on sets 𝒮tr\mathcal{S}_{tr} and 𝒮te\mathcal{S}_{te} and obtain the clustering results C(𝒮tr,k)C(\mathcal{S}_{tr},k) and C(𝒮te,k)C(\mathcal{S}_{te},k).

AA是聚类C𝒮tr,kC(\mathcal{S}_{tr},k)使用训练集构建。簇在AA可以看作是区域。如果一个示例属于这些区域之一,则该示例属于某个特定的集群。例如,如果我们将 k-means 算法应用于某个数据集,则会将特征空间划分为kk多边形区域,如图所示。  46 .

Let AA be the clustering C(𝒮tr,k)C(\mathcal{S}_{tr},k) built using the training set. The clusters in AA can be seen as regions. If an example falls within one of those regions, then that example belongs to some specific cluster. For example, if we apply the k-means algorithm to some dataset, it results in a partition of the feature space into kk polygonal regions, as we saw in fig. 46.

定义te×teN_{te} \times N_{te} 共同成员矩阵 𝐃[A,𝒮te]\mathbf{D}[A,\mathcal{S}_{te}]如下:𝐃[A,𝒮te],=1\mathbf{D}[A,\mathcal{S}_{te}]^{(i,i')} = 1当且仅当示例𝐱\mathbf{x}_i𝐱\mathbf{x}_{i'}根据聚类结果,测试集中属于同一簇AA。否则D[A,𝒮te],=0D[A,\mathcal{S}_{te}]^{(i,i')} = 0

Define the Nte×NteN_{te} \times N_{te} co-membership matrix 𝐃[A,𝒮te]\mathbf{D}[A,\mathcal{S}_{te}] as follows: 𝐃[A,𝒮te](i,i)=1\mathbf{D}[A,\mathcal{S}_{te}]^{(i,i')} = 1 if and only if examples 𝐱i\mathbf{x}_i and 𝐱i\mathbf{x}_{i'} from the test set belong to the same cluster according to the clustering AA. Otherwise D[A,𝒮te](i,i)=0D[A,\mathcal{S}_{te}]^{(i,i')} = 0.

让我们休息一下,看看这里有什么。我们使用训练示例集构建了一个聚类AA具有kk集群。然后我们构建了共同隶属矩阵,该矩阵指示测试集中的两个示例是否属于同一簇AA

Let’s take a break and see what we have here. We have built, using the training set of examples, a clustering AA that has kk clusters. Then we have built the co-membership matrix that indicates whether two examples from the test set belong to the same cluster in AA.

直观上来说,如果数量kk是合理的簇数,那么聚类中属于同一簇的两个例子C𝒮te,kC(\mathcal{S}_{te},k)在聚类中很可能属于同一个簇C𝒮tr,kC(\mathcal{S}_{tr},k)。另一方面,如果kk不合理(太高或太低),那么基于训练数据和基于测试数据的聚类可能会不太一致。

Intuitively, if the quantity kk is the reasonable number of clusters, then two examples that belong to the same cluster in clustering C(𝒮te,k)C(\mathcal{S}_{te},k) will most likely belong to the same cluster in clustering C(𝒮tr,k)C(\mathcal{S}_{tr},k). On the other hand, if kk is not reasonable (too high or too low), then training data-based and test data-based clusterings will likely be less consistent.

图 47:用于图 47 所示聚类的数据。 48.
图 47:用于图 47 所示聚类的数据。  48 .

使用如图所示的数据。  47,这个想法如下图所示:

Using the data shown in fig. 47, the idea is illustrated below:

图 48:k = 4 时的聚类:(a) 训练数据聚类; (b) 测试数据聚类; (c) 在训练聚类上绘制的测试数据。
图 48:聚类k=4k = 4:(a)训练数据聚类; (b) 测试数据聚类; (c) 在训练聚类上绘制的测试数据。

图中的情节。 图48a和图48a。  48b分别显示C𝒮tr,4C(\mathcal{S}_{tr},4)C𝒮te,4C(\mathcal{S}_{te},4)及其各自的集群区域。

The plots in fig. 48a and fig. 48b show respectively C(𝒮tr,4)C(\mathcal{S}_{tr},4) and C(𝒮te,4)C(\mathcal{S}_{te},4) with their respective cluster regions.

绘制在训练数据集群区域上的测试示例如图 2 所示。  48 c.你可以在图中看到。 从图48c可以看出,根据从训练数据获得的聚类区域,橙色测试示例不再属于同一聚类。这将导致矩阵中有许多零𝐃[A,𝒮te]\mathbf{D}[A,\mathcal{S}_{te}]反过来,这是一个指标k=4k = 4可能不是最佳的簇数。

Test examples plotted over the training data cluster regions are shown in fig. 48c. You can see in fig. 48c that orange test examples don’t belong anymore to the same cluster according to the clustering regions obtained from the training data. This will result in many zeroes in the matrix 𝐃[A,𝒮te]\mathbf{D}[A,\mathcal{S}_{te}] which, in turn, is an indicator that k=4k = 4 is likely not the best number of clusters.

更正式地说,是簇数量的预测强度kk是(谁)给的,

More formally, the prediction strength for the number of clusters kk is given by,

附注k=定义分钟j=1,……,k1|Aj||Aj|-1Σ,εAj𝐃[A,𝒮te],, \begin{split}\operatorname{ps}(k) \stackrel{\text{def}}{=} \min_{j=1,\ldots,k} \frac{1}{|A_j|(|A_j| - 1)} \\ \cdot\sum_{i,i' \in A_j} \mathbf{D}[A,\mathcal{S}_{te}]^{(i,i')},\end{split}

ps(k)=defminj=1,,k1|Aj|(|Aj|1)i,iAj𝐃[A,𝒮te](i,i), \begin{split}\operatorname{ps}(k) \stackrel{\text{def}}{=} \min_{j=1,\ldots,k} \frac{1}{|A_j|(|A_j| - 1)} \\ \cdot\sum_{i,i' \in A_j} \mathbf{D}[A,\mathcal{S}_{te}]^{(i,i')},\end{split}

在哪里A=定义C𝒮tr,kA \stackrel{\text{def}}{=} C(\mathcal{S}_{tr},k),AjA_jjthj^{\textrm{th}}从聚类中聚类C𝒮te,kC(\mathcal{S}_{te},k)|Aj||A_j|是集群中的示例数AjA_j

where A=defC(𝒮tr,k)A \stackrel{\text{def}}{=} C(\mathcal{S}_{tr},k), AjA_j is jthj^{\textrm{th}} cluster from the clustering C(𝒮te,k)C(\mathcal{S}_{te},k) and |Aj||A_j| is the number of examples in cluster AjA_j.

给定一个聚类C𝒮tr,kC(\mathcal{S}_{tr},k),对于每个测试集群,我们计算该集群中也由训练集质心分配给同一集群的观察对的比例。预测强度是该数量的最小值kk测试集群。

Given a clustering C(𝒮tr,k)C(\mathcal{S}_{tr},k), for each test cluster, we compute the proportion of observation pairs in that cluster that are also assigned to the same cluster by the training set centroids. The prediction strength is the minimum of this quantity over the kk test clusters.

实验表明合理的簇数是最大的kk这样附注k\operatorname{ps}(k)上面是0.80.8。在图中。  49,您可以看到不同值的预测强度的示例kk对于二簇、三簇和四簇数据。

Experiments suggest that a reasonable number of clusters is the largest kk such that ps(k)\operatorname{ps}(k) is above 0.80.8. In fig. 49, you can see examples of predictive strength for different values of kk for two-, three- and four-cluster data.

图 49:二聚类、三聚类和四聚类数据的不同 k 值的预测强度。
图 49:不同值的预测强度kk对于二簇、三簇和四簇数据。

对于非确定性聚类算法,例如 k-means,它可以根据质心的初始位置生成不同的聚类,建议对同一聚类算法进行多次运行kk并计算平均预测强度附注k\bar{\operatorname{ps}}(k)经过多次运行。

For non-deterministic clustering algorithms, such as k-means, which can generate different clusterings depending on the initial positions of centroids, it is recommended to do multiple runs of the clustering algorithm for the same kk and compute the average prediction strength ps(k)\bar{\operatorname{ps}}(k) over multiple runs.

估计聚类数量的另一种有效方法是间隙统计方法。其他不太自动化的方法(一些分析师仍在使用)包括肘法平均轮廓法

Another effective method to estimate the number of clusters is the gap statistic method. Other, less automatic methods, which some analysts still use, include the elbow method and the average silhouette method.

9.2.4其他聚类算法

9.2.4 Other Clustering Algorithms

DBSCAN 和 k-means 计算所谓的硬聚类,其中每个示例只能属于一个聚类。高斯混合模型(GMM) 允许每个示例成为具有不同成员分数的多个集群的成员(HDBSCAN 也允许这样做)。计算 GMM 与进行基于模型的密度估计非常相似。在 GMM 中,我们不是只有一个多元正态分布 (MND),而是多个 MND 的加权和:

DBSCAN and k-means compute so-called hard clustering, in which each example can belong to only one cluster. Gaussian mixture model (GMM) allows each example to be a member of several clusters with different membership score (HDBSCAN also allows this). Computing a GMM is very similar to doing model-based density estimation. In GMM, instead of having just one multivariate normal distribution (MND), we have a weighted sum of several MNDs:

FX=Σj=1kφjF𝛍j,𝚺j, f_X = \sum_{j = 1}^k \phi_j f_{\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j},

fX=j=1kϕjf𝛍j,𝚺j, f_X = \sum_{j = 1}^k \phi_j f_{\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j},

在哪里F𝛍j,𝚺jf_{\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j}是一个运动神经元病jj, 和φj\phi_j是其总和中的权重。参数值𝛍j\boldsymbol{\mu}_j,𝚺j\boldsymbol{\Sigma}_j, 和φj\phi_j, 对全部j=1,……,kj = 1, \ldots, k使用期望最大化算法(EM)优化最大似然准则获得。

where f𝛍j,𝚺jf_{\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j} is a MND jj, and ϕj\phi_j is its weight in the sum. The values of parameters 𝛍j\boldsymbol{\mu}_j, 𝚺j\boldsymbol{\Sigma}_j, and ϕj\phi_j, for all j=1,,kj = 1, \ldots, k are obtained using the expectation maximization algorithm (EM) to optimize the maximum likelihood criterion.

同样,为了简单起见,让我们看一下一维数据。还假设有两个簇:k=2k = 2。在这种情况下,我们有两个高斯分布,

Again, for simplicity, let us look at the one-dimensional data. Also assume that there are two clusters: k=2k = 2. In this case, we have two Gaussian distributions,

FXμ1,σ12=12πσ12经验值-X-μ122σ1226 \begin{split} f(x \mid \mu_1 ,\sigma_1^2)=\frac{1}{\sqrt{2\pi\sigma_1^2}}\\ \cdot\exp{-{\frac{(x-\mu_1)^{2}}{2\sigma_1^{2}}}}\end{split} \qquad(26)

f(xμ1,σ12)=12πσ12exp(xμ1)22σ12(26) \begin{split} f(x \mid \mu_1 ,\sigma_1^2)=\frac{1}{\sqrt{2\pi\sigma_1^2}}\\ \cdot\exp{-{\frac{(x-\mu_1)^{2}}{2\sigma_1^{2}}}}\end{split} \qquad(26)

and

FXμ2,σ22=12πσ22经验值-X-μ222σ22,27 \begin{split} f(x \mid \mu_2 ,\sigma_2^2)=\frac{1}{\sqrt{2\pi\sigma_2^2}}\\ \cdot\exp{-{\frac{(x-\mu_2)^{2}}{2\sigma_2^2}}},\end{split} \qquad(27)

f(xμ2,σ22)=12πσ22exp(xμ2)22σ22,(27) \begin{split} f(x \mid \mu_2 ,\sigma_2^2)=\frac{1}{\sqrt{2\pi\sigma_2^2}}\\ \cdot\exp{-{\frac{(x-\mu_2)^{2}}{2\sigma_2^2}}},\end{split} \qquad(27)

在哪里FXμ1,σ12f(x \mid \mu_1 ,\sigma_1^2)FXμ2,σ22f(x \mid \mu_2 ,\sigma_2^2)是两个定义可能性的 pdfX=XX = x

where f(xμ1,σ12)f(x \mid \mu_1 ,\sigma_1^2) and f(xμ2,σ22)f(x \mid \mu_2 ,\sigma_2^2) are two pdf defining the likelihood of X=xX = x.

我们使用EM算法来估计μ1\mu_1,σ12\sigma_1^2,μ2\mu_2,σ22\sigma_2^2,φ1\phi_1, 和φ2\phi_2。参数φ1\phi_1φ2\phi_2对于密度估计很有用,而对于聚类则不太有用,正如我们将在下面看到的。

We use the EM algorithm to estimate μ1\mu_1, σ12\sigma_1^2, μ2\mu_2, σ22\sigma_2^2, ϕ1\phi_1, and ϕ2\phi_2. The parameters ϕ1\phi_1 and ϕ2\phi_2 are useful for the density estimation and less useful for clustering, as we will see below.

EM 的工作原理如下。一开始,我们猜测初始值μ1\mu_1,σ12\sigma_1^2,μ2\mu_2, 和σ22\sigma_2^2,并设置φ1=φ2=12\phi_1 = \phi_2 = \frac{1}{2}(一般来说,这是1k\frac{1}{k}对于每个φj\phi_j,jε1,……,kj \in {1,\ldots,k})。

EM works as follows. In the beginning, we guess the initial values for μ1\mu_1, σ12\sigma_1^2, μ2\mu_2, and σ22\sigma_2^2, and set ϕ1=ϕ2=12\phi_1 = \phi_2 = \frac{1}{2} (in general, it’s 1k\frac{1}{k} for each ϕj\phi_j, j1,,kj \in {1,\ldots,k}).

在 EM 的每次迭代中,都会执行以下四个步骤:

At each iteration of EM, the following four steps are executed:

  1. 对全部=1,……,i = 1, \ldots, N,计算每个的可能性Xx_i使用等式 26和等式。  27
  2. For all i=1,,Ni = 1, \ldots, N, calculate the likelihood of each xix_i using eq. 26 and eq. 27:

FXμ1,σ1212πσ12经验值-X-μ122σ12 f(x_i \mid \mu_1 ,\sigma_1^2)\gets\frac{1}{\sqrt{2\pi\sigma_1^2}}\exp{-{\frac{(x_i-\mu_1)^{2}}{2\sigma_1^2}}}

f(xiμ1,σ12)12πσ12exp(xiμ1)22σ12 f(x_i \mid \mu_1 ,\sigma_1^2)\gets\frac{1}{\sqrt{2\pi\sigma_1^2}}\exp{-{\frac{(x_i-\mu_1)^{2}}{2\sigma_1^2}}}

and

FXμ2,σ2212πσ22经验值-X-μ222σ22 f(x_i \mid \mu_2 ,\sigma_2^2)\gets\frac{1}{\sqrt{2\pi\sigma_2^2}}\exp{-{\frac{(x_i-\mu_2)^{2}}{2\sigma_2^2}}}.

f(xiμ2,σ22)12πσ22exp(xiμ2)22σ22. f(x_i \mid \mu_2 ,\sigma_2^2)\gets\frac{1}{\sqrt{2\pi\sigma_2^2}}\exp{-{\frac{(x_i-\mu_2)^{2}}{2\sigma_2^2}}}.

  1. 对于每个示例,使用贝叶斯规则Xx_i,计算可能性jb^{(j)}_i该示例属于集群jε{1,2}j \in \{1,2\}(换句话说,该示例来自高斯分布的可能性jj):
  2. Using Bayes’ Rule, for each example xix_i, calculate the likelihood bi(j)b^{(j)}_i that the example belongs to cluster j{1,2}j \in \{1,2\} (in other words, the likelihood that the example was drawn from the Gaussian jj):

jFXμj,σj2φjFXμ1,σ12φ1+FXμ2,σ22φ2 b^{(j)}_i \leftarrow \frac{f(x_i \mid \mu_j ,\sigma_j^2)\phi_j}{f(x_i \mid \mu_1 ,\sigma_1^2)\phi_1 + f(x_i \mid \mu_2 ,\sigma_2^2)\phi_2}.

bi(j)f(xiμj,σj2)ϕjf(xiμ1,σ12)ϕ1+f(xiμ2,σ22)ϕ2. b^{(j)}_i \leftarrow \frac{f(x_i \mid \mu_j ,\sigma_j^2)\phi_j}{f(x_i \mid \mu_1 ,\sigma_1^2)\phi_1 + f(x_i \mid \mu_2 ,\sigma_2^2)\phi_2}.

参数φj\phi_j反映了我们的高斯分布的可能性有多大jj带参数μj\mu_jσj2\sigma_j^2可能已经生成了我们的数据集。这就是为什么我们一开始就设定φ1=φ2=12\phi_1 = \phi_2 = \frac{1}{2}:我们不知道两个高斯函数的可能性如何,我们通过将两者的可能性设置为二分之一来反映我们的无知。

The parameter ϕj\phi_j reflects how likely is that our Gaussian distribution jj with parameters μj\mu_j and σj2\sigma_j^2 may have produced our dataset. That is why in the beginning we set ϕ1=ϕ2=12\phi_1 = \phi_2 = \frac{1}{2}: we don’t know how each of the two Gaussians is likely, and we reflect our ignorance by setting the likelihood of both to one half.

  1. 计算新值μj\mu_jσj2\sigma_j^2,jε{1,2}j \in \{1,2\}作为,
  2. Compute the new values of μj\mu_j and σj2\sigma_j^2, j{1,2}j \in \{1,2\} as,

μjΣ=1jXΣ=1j28 \mu_j \leftarrow \frac{\sum_{i=1}^N b^{(j)}_i x_i}{\sum_{i=1}^N b^{(j)}_i} \qquad(28)

μji=1Nbi(j)xii=1Nbi(j)(28) \mu_j \leftarrow \frac{\sum_{i=1}^N b^{(j)}_i x_i}{\sum_{i=1}^N b^{(j)}_i} \qquad(28)

and

σj2Σ=1jX-μj2Σ=1j29 \sigma_j^2 \leftarrow \frac{\sum_{i=1}^N b^{(j)}_i (x_i - \mu_j)^2}{\sum_{i=1}^N b^{(j)}_i}. \qquad(29)

σj2i=1Nbi(j)(xiμj)2i=1Nbi(j).(29) \sigma_j^2 \leftarrow \frac{\sum_{i=1}^N b^{(j)}_i (x_i - \mu_j)^2}{\sum_{i=1}^N b^{(j)}_i}. \qquad(29)

  1. 更新φj\phi_j,jε{1,2}j \in \{1,2\}作为,
  2. Update ϕj\phi_j, j{1,2}j \in \{1,2\} as,

φj1Σ=1j \phi_j \leftarrow \frac{1}{N} \sum_{i=1}^N b^{(j)}_i.

ϕj1Ni=1Nbi(j). \phi_j \leftarrow \frac{1}{N} \sum_{i=1}^N b^{(j)}_i.

步骤1-41-4迭代执行直到值μj\mu_jσj2\sigma_j^2变化不大:例如,变化低于某个阈值ε\epsilon。其过程如图所示。  50以下。

The steps 141-4 are executed iteratively until the values μj\mu_j and σj2\sigma_j^2 don’t change much: for example, the change is below some threshold ϵ\epsilon. The precess is illustrated in fig. 50 below.

图 50:使用 EM 算法对两个簇 (k = 2) 进行高斯混合模型估计的进度。
图 50:使用 EM 算法对两个簇进行高斯混合模型估计的进度(k=2k = 2)。

您可能已经注意到,EM 算法与 k 均值算法非常相似:从随机集群开始,然后通过对分配给该集群的数据进行平均来迭代更新每个集群的参数。 GMM 的唯一区别是示例的分配Xx_i到集群jjXx_i属于簇jj有概率jb^{(j)}_i。这就是我们计算新值的原因μj\mu_jσj2\sigma_j^2在等式中 28和等式。  29不是平均值(用于 k 均值),而是带有权重的加权平均值jb^{(j)}_i

You may have noticed that the EM algorithm is very similar to the k-means algorithm: start with random clusters, then iteratively update each cluster’s parameters by averaging the data that is assigned to that cluster. The only difference in the case of GMM is that the assignment of an example xix_i to the cluster jj is soft: xix_i belongs to cluster jj with probability bi(j)b^{(j)}_i. This is why we calculate the new values for μj\mu_j and σj2\sigma_j^2 in eq. 28 and eq. 29 not as an average (used in k-means) but as a weighted average with weights bi(j)b^{(j)}_i.

一旦我们了解了参数μj\mu_jσj2\sigma_j^2对于每个簇jj, 示例的会员分数Xx簇状jj是(谁)给的FXμj,σj2f(x \mid \mu_j ,\sigma_j^2)

Once we have learned the parameters μj\mu_j and σj2\sigma_j^2 for each cluster jj, the membership score of example xx in cluster jj is given by f(xμj,σj2)f(x \mid \mu_j ,\sigma_j^2).

扩展至DD维数据(D>1D > 1)很简单。唯一的区别是,而不是方差σ2\sigma^2,我们现在有了协方差矩阵𝚺\mathbf{\Sigma}对多项式正态分布 (MND) 进行参数化。

The extension to DD-dimensional data (D>1D > 1) is straightforward. The only difference is that instead of the variance σ2\sigma^2, we now have the covariance matrix 𝚺\mathbf{\Sigma} that parametrizes the multinomial normal distribution (MND).

与 k-means 中的簇只能是圆形相反,GMM 中的簇具有椭圆形的形式,可以任意伸长和旋转。协方差矩阵中的值控制这些属性。

Contrary to k-means where clusters can only be circular, the clusters in GMM have the form of an ellipse that can have an arbitrary elongation and rotation. The values in the covariance matrix control these properties.

没有普遍认可的方法来选择正确的kk在 GMM 中。我建议您首先将数据集分为训练集和测试集。然后你尝试不同的kk并建立一个不同的模型Ftrkf^k_{tr}对于每个kk在训练数据上。你选择的值是kk最大化测试集中示例的可能性:

There’s no universally recognized method to choose the right kk in GMM. I recommend that you first split the dataset into training and test set. Then you try different kk and build a different model ftrkf^k_{tr} for each kk on the training data. You pick the value of kk that maximizes the likelihood of examples in the test set:

精氨酸最大限度k=1|te|Ftrk𝐱, \underset{k}{\arg\max} \prod_{i=1}^{|N_{te}|} f^k_{tr} (\mathbf{x}_i),

argmaxki=1|Nte|ftrk(𝐱i), \underset{k}{\arg\max} \prod_{i=1}^{|N_{te}|} f^k_{tr} (\mathbf{x}_i),

在哪里|te||N_{te}|是测试集的大小。

where |Nte||N_{te}| is the size of the test set.

文献中描述了多种聚类算法。值得一提的是谱聚类层次聚类。对于某些数据集,您可能会发现它们更合适。然而,在大多数实际情况下,k-means、HDBSCAN 和高斯混合模型就可以满足您的需求。

There is a variety of clustering algorithms described in the literature. Worth mentioning are spectral clustering and hierarchical clustering. For some datasets, you may find those more appropriate. However, in most practical cases, k-means, HDBSCAN and the Gaussian mixture model would satisfy your needs.

9.3降维

9.3 Dimensionality Reduction

现代机器学习算法,例如集成算法和神经网络,可以很好地处理非常高维的示例,多达数百万个特征。对于现代计算机和图形处理单元 (GPU),降维技术在实践中的使用比过去更少。降维最常见的用例是数据可视化:人类最多只能解释绘图上的三个维度。

Modern machine learning algorithms, such as ensemble algorithms and neural networks, handle well very high-dimensional examples, up to millions of features. With modern computers and graphical processing units (GPUs), dimensionality reduction techniques are used less in practice than in the past. The most frequent use case for dimensionality reduction is data visualization: humans can only interpret a maximum of three dimensions on a plot.

另一种可以从降维中受益的情况是,当您必须构建可解释的模型时,您在学习算法的选择上受到限制。例如,您只能使用决策树学习或线性回归。通过将数据减少到较低的维度,并确定减少的特征空间中的每个新特征反映的原始示例的质量,您可以使用更简单的算法。降维去除冗余或高度相关的特征;它还减少了数据中的噪声——所有这些都有助于模型的可解释性。

Another situation in which you could benefit from dimensionality reduction is when you have to build an interpretable model and to do so you are limited in your choice of learning algorithms. For example, you can only use decision tree learning or linear regression. By reducing your data to lower dimensionality and by figuring out which quality of the original example each new feature in the reduced feature space reflects, you can use simpler algorithms. Dimensionality reduction removes redundant or highly correlated features; it also reduces the noise in the data — all that contributes to the interpretability of the model.

三种广泛使用的降维技术是主成分分析(PCA)、均匀流形逼近和投影(UMAP) 以及自动编码器

Three widely used techniques of dimensionality reduction are principal component analysis (PCA), uniform manifold approximation and projection (UMAP), and autoencoders.

我已经在第 7 章解释了自动编码器。您可以使用自动编码器瓶颈层的低维输出作为表示高维输入特征向量的降维向量。您知道这个低维向量代表了输入向量中包含的基本信息,因为自动编码器能够仅根据瓶颈层输出来重建输入特征向量。

I already explained autoencoders in Chapter 7. You can use the low-dimensional output of the bottleneck layer of the autoencoder as the vector of reduced dimensionality that represents the high-dimensional input feature vector. You know that this low-dimensional vector represents the essential information contained in the input vector because the autoencoder is capable of reconstructing the input feature vector based on the bottleneck layer output alone.

9.3.1主成分分析

9.3.1 Principal Component Analysis

主成分分析(PCA)是最古老的降维方法之一。它背后的数学涉及到我在第 2 章中没有解释的矩阵运算,因此我将 PCA 的数学留给您进一步阅读。在这里,我仅提供直觉并通过示例说明该方法。

Principal component analysis or PCA is one of the oldest dimensionality reduction methods. The math behind it involves operation on matrices that I didn’t explain in Chapter 2, so I leave the math of PCA for your further reading. Here, I only provide intuition and illustrate the method on an example.

考虑如图 2 所示的二维数据集。  51 a.

Consider a two-dimensional dataset as shown in fig. 51a.

图 51:PCA:(a) 原始数据; (b) 显示为向量的两个主成分; (c) 投影在第一主成分上的数据。
图 51:PCA:(a) 原始数据; (b) 显示为向量的两个主成分; (c) 投影在第一主成分上的数据。

主成分是定义新坐标系的向量,其中第一个轴沿着数据中最高方差的方向。第二个轴与第一个轴正交,并且沿着数据中第二大方差的方向。如果我们的数据是三维的,则第三个轴将与第一轴和第二轴正交,并沿第三最高方差的方向行进,依此类推。在图中。 在图51b中,两个主要部件如箭头所示。箭头的长度反映了该方向的方差。

Principal components are vectors that define a new coordinate system in which the first axis goes in the direction of the highest variance in the data. The second axis is orthogonal to the first one and goes in the direction of the second highest variance in the data. If our data was three-dimensional, the third axis would be orthogonal to both the first and the second axes and go in the direction of the third highest variance, and so on. In fig. 51b, the two principal components are shown as arrows. The length of the arrow reflects the variance in this direction.

现在,如果我们想将数据的维度减少到Dnew<DD_{new} < D,我们选择DnewD_{new}最大的主成分并将我们的数据点投影到它们上。对于我们的二维插图,我们可以设置Dnew=1D_{new} = 1并将我们的示例投影到第一个主成分以获得图 1 中的橙色点。  51 c.

Now, if we want to reduce the dimensionality of our data to Dnew<DD_{new} < D, we pick DnewD_{new} largest principal components and project our data points on them. For our two-dimensional illustration, we can set Dnew=1D_{new} = 1 and project our examples to the first principal component to obtain the orange points in fig. 51c.

为了描述每个橙色点,我们只需要一个坐标而不是两个:相对于第一个主成分的坐标。当我们的数据非常高维时,在实践中经常发生前两个或三个主成分解释了数据的大部分变化,因此通过在 2D 或 3D 图上显示数据,我们确实可以看到非常高的维数。维数据及其属性。

To describe each orange point, we need only one coordinate instead of two: the coordinate with respect to the first principal component. When our data is very high-dimensional, it often happens in practice that the first two or three principal components account for most of the variation in the data, so by displaying the data on a 2D or 3D plot we can indeed see a very high-dimensional data and its properties.

9.3.2乌玛普

9.3.2 UMAP

许多现代降维算法,特别是那些专门为可视化目的而设计的算法(例如t-SNEUMAP)背后的想法基本上是相同的。我们首先为两个示例设计相似性度量。出于可视化目的,除了两个示例之间的欧几里得距离之外,这种相似性度量通常还反映两个示例的一些局部属性,例如它们周围其他示例的密度。

The idea behind many of the modern dimensionality reduction algorithms, especially those designed specifically for visualization purposes such as t-SNE and UMAP, is basically the same. We first design a similarity metric for two examples. For visualization purposes, besides the Euclidean distance between the two examples, this similarity metric often reflects some local properties of the two examples, such as the density of other examples around them.

在 UMAP 中,这个相似度度量ww定义如下,

In UMAP, this similarity metric ww is defined as follows,

w𝐱,𝐱j=定义w𝐱,𝐱j+wj𝐱j,𝐱-w𝐱,𝐱jwj𝐱j,𝐱30 \begin{split}w(\mathbf{x}_i,\mathbf{x}_j) \stackrel{\text{def}}{=} w_i(\mathbf{x}_i,\mathbf{x}_j) + w_j(\mathbf{x}_j,\mathbf{x}_i) \\ - w_i(\mathbf{x}_i,\mathbf{x}_j)w_j(\mathbf{x}_j,\mathbf{x}_i).\end{split} \qquad(30)

w(𝐱i,𝐱j)=defwi(𝐱i,𝐱j)+wj(𝐱j,𝐱i)wi(𝐱i,𝐱j)wj(𝐱j,𝐱i).(30) \begin{split}w(\mathbf{x}_i,\mathbf{x}_j) \stackrel{\text{def}}{=} w_i(\mathbf{x}_i,\mathbf{x}_j) + w_j(\mathbf{x}_j,\mathbf{x}_i) \\ - w_i(\mathbf{x}_i,\mathbf{x}_j)w_j(\mathbf{x}_j,\mathbf{x}_i).\end{split} \qquad(30)

功能w𝐱,𝐱jw_i(\mathbf{x}_i,\mathbf{x}_j)定义为,

The function wi(𝐱i,𝐱j)w_i(\mathbf{x}_i,\mathbf{x}_j) is defined as,

w𝐱,𝐱j=定义经验值-d𝐱,𝐱j-ρσ, w_i(\mathbf{x}_i,\mathbf{x}_j) \stackrel{\text{def}}{=} \exp\left(-\frac{d(\mathbf{x}_i,\mathbf{x}_j)-\rho_i}{\sigma_i}\right),

wi(𝐱i,𝐱j)=defexp(d(𝐱i,𝐱j)ρiσi), w_i(\mathbf{x}_i,\mathbf{x}_j) \stackrel{\text{def}}{=} \exp\left(-\frac{d(\mathbf{x}_i,\mathbf{x}_j)-\rho_i}{\sigma_i}\right),

在哪里d𝐱,𝐱jd(\mathbf{x}_i,\mathbf{x}_j)是两个例子之间的欧几里德距离,ρ\rho_i是距离𝐱\mathbf{x}_i到它最近的邻居,并且σ\sigma_i是距离𝐱\mathbf{x}_i对其kthk^{\textrm{th}}最近邻(kk是算法的超参数)。

where d(𝐱i,𝐱j)d(\mathbf{x}_i,\mathbf{x}_j) is the Euclidean distance between two examples, ρi\rho_i is the distance from 𝐱i\mathbf{x}_i to its closest neighbor, and σi\sigma_i is the distance from 𝐱i\mathbf{x}_i to its kthk^{\textrm{th}} closest neighbor (kk is a hyperparameter of the algorithm).

可以证明方程中的度量。  30 的变化范围为0011并且是对称的,这意味着w𝐱,𝐱j=w𝐱j,𝐱w(\mathbf{x}_i,\mathbf{x}_j) = w(\mathbf{x}_j, \mathbf{x}_i)

It can be shown that the metric in eq. 30 varies in the range from 00 to 11 and is symmetric, which means that w(𝐱i,𝐱j)=w(𝐱j,𝐱i)w(\mathbf{x}_i,\mathbf{x}_j) = w(\mathbf{x}_j, \mathbf{x}_i).

ww表示原始高维空间中两个示例的相似度,并让ww'是由相同方程给出的相似度。  30在新的低维空间中。

Let ww denote the similarity of two examples in the original high-dimensional space and let ww' be the similarity given by the same eq. 30 in the new low-dimensional space.

为了继续,我需要快速介绍模糊集的概念。模糊集是集合的概括。对于每个元素Xx在模糊集合中𝒮\mathcal{S},有一个隶属函数μ𝒮Xε[0,1]\mu_{\mathcal{S}}(x) \in [0,1]定义了会员的实力Xx到集合𝒮\mathcal{S}。我们这么说Xx弱属于模糊集𝒮\mathcal{S}如果μ𝒮X\mu_{\mathcal{S}}(x)接近于零。另一方面,如果μ𝒮X\mu_{\mathcal{S}}(x)接近11, 然后Xx拥有强大的会员资格𝒮\mathcal{S}。如果μX=1\mu(x) = 1对全部Xε𝒮x \in \mathcal{S},然后是一个模糊集𝒮\mathcal{S}变得等价于一个正常的、非模糊的集合。

To continue, I need to quickly introduce the notion of a fuzzy set. A fuzzy set is a generalization of a set. For each element xx in a fuzzy set 𝒮\mathcal{S}, there’s a membership function μ𝒮(x)[0,1]\mu_{\mathcal{S}}(x) \in [0,1] that defines the membership strength of xx to the set 𝒮\mathcal{S}. We say that xx weakly belongs to a fuzzy set 𝒮\mathcal{S} if μ𝒮(x)\mu_{\mathcal{S}}(x) is close to zero. On the other hand, if μ𝒮(x)\mu_{\mathcal{S}}(x) is close to 11, then xx has a strong membership in 𝒮\mathcal{S}. If μ(x)=1\mu(x) = 1 for all x𝒮x \in \mathcal{S}, then a fuzzy set 𝒮\mathcal{S} becomes equivalent to a normal, nonfuzzy set.

现在让我们看看为什么我们需要模糊集的概念。

Let’s now see why we need this notion of a fuzzy set here.

因为价值观wwww'位于之间的范围0011, 我们可以看到w𝐱,𝐱jw(\mathbf{x}_i, \mathbf{x}_j)作为这对例子的成员𝐱,𝐱j(\mathbf{x}_i, \mathbf{x}_j)在某个模糊集合中。同样可以这样说ww'。两个模糊集的相似性概念称为模糊集交叉熵,定义为:

Because the values of ww and ww' lie in the range between 00 and 11, we can see w(𝐱i,𝐱j)w(\mathbf{x}_i, \mathbf{x}_j) as membership of the pair of examples (𝐱i,𝐱j)(\mathbf{x}_i, \mathbf{x}_j) in a certain fuzzy set. The same can be said about ww'. The notion of similarity of two fuzzy sets is called fuzzy set cross-entropy and is defined as,

Cw,w=Σ=1Σj=1[w𝐱,𝐱jw𝐱,𝐱jw𝐱,𝐱j+1-w𝐱,𝐱j1-w𝐱,𝐱j1-w𝐱,𝐱j],31 \begin{split} C_{w,w'} = \sum_{i=1}^N\sum_{j=1}^N\Bigg[ w(\mathbf{x}_i,\mathbf{x}_j)\\ \cdot\ln\left(\frac{w(\mathbf{x}_i,\mathbf{x}_j)}{w'(\mathbf{x}'_i,\mathbf{x}'_j)}\right) + (1-w(\mathbf{x}_i,\mathbf{x}_j))\\ \cdot\ln \left(\frac{1-w(\mathbf{x}_i,\mathbf{x}_j)}{1-w'(\mathbf{x}'_i,\mathbf{x}'_j)}\right)\Bigg],\end{split} \qquad(31)

Cw,w=i=1Nj=1N[w(𝐱i,𝐱j)ln(w(𝐱i,𝐱j)w(𝐱i,𝐱j))+(1w(𝐱i,𝐱j))ln(1w(𝐱i,𝐱j)1w(𝐱i,𝐱j))],(31) \begin{split} C_{w,w'} = \sum_{i=1}^N\sum_{j=1}^N\Bigg[ w(\mathbf{x}_i,\mathbf{x}_j)\\ \cdot\ln\left(\frac{w(\mathbf{x}_i,\mathbf{x}_j)}{w'(\mathbf{x}'_i,\mathbf{x}'_j)}\right) + (1-w(\mathbf{x}_i,\mathbf{x}_j))\\ \cdot\ln \left(\frac{1-w(\mathbf{x}_i,\mathbf{x}_j)}{1-w'(\mathbf{x}'_i,\mathbf{x}'_j)}\right)\Bigg],\end{split} \qquad(31)

在哪里𝐱\mathbf{x}'是原始高维示例的低维“版本”𝐱\mathbf{x}

where 𝐱\mathbf{x}' is the low-dimensional “version” of the original high-dimensional example 𝐱\mathbf{x}.

在等式中。  31未知参数为𝐱\mathbf{x}'_i(对全部=1,……,i = 1,\ldots,N),我们寻找的低维例子。我们可以通过最小化梯度下降来计算它们Cw,wC_{w,w'}

In eq. 31 the unknown parameters are 𝐱i\mathbf{x}'_i (for all i=1,,Ni = 1,\ldots,N), the low-dimensional examples we look for. We can compute them by gradient descent by minimizing Cw,wC_{w,w'}.

在图中。  52-图。 如图 54 所示,您可以看到对手写数字的 MNIST 数据集进行降维的结果。

In fig. 52-fig. 54, you can see the result of dimensionality reduction applied to the MNIST dataset of handwritten digits.

图 52:使用 PCA 对 MNIST 数据集进行降维。
图 52:使用 PCA 对 MNIST 数据集进行降维。
图 53:使用 UMAP 对 MNIST 数据集进行降维。
图 53:使用 UMAP 对 MNIST 数据集进行降维。
图 54:使用自动编码器对 MNIST 数据集进行降维。
图 54:使用自动编码器对 MNIST 数据集进行降维。

MNIST 通常用于对各种图像处理系统进行基准测试;它包含 70,000 个带标签的示例。绘图上的十种不同颜色对应于十个类别。图中的每个点对应于数据集中的一个特定示例。正如您所看到的,UMAP 在视觉上更好地分离了示例(请记住,它无法访问标签)。实际上,UMAP 比 PCA 稍慢,但比自动编码器快。

MNIST is commonly used for benchmarking various image processing systems; it contains 70,000 labeled examples. Ten different colors on the plot correspond to ten classes. Each point on the plot corresponds a specific example in the dataset. As you can see, UMAP separates examples visually better (remember, it doesn’t have access to labels). In practice, UMAP is slightly slower than PCA but faster than autoencoder.

9.4异常值检测

9.4 Outlier Detection

异常值检测是检测数据集中与数据集中的典型示例非常不同的示例的问题。我们已经看到了几种可以帮助解决这个问题的技术:自动编码器和一类分类器学习。如果我们使用自动编码器,我们会在数据集上训练它。然后,如果我们想要预测一个示例是否是异常值,我们可以使用自动编码器模型从瓶颈层重建示例。该模型不太可能能够重建异常值。

Outlier detection is the problem of detecting the examples in the dataset that are very different from what a typical example in the dataset looks like. We have already seen several techniques that could help to solve this problem: autoencoder and one-class classifier learning. If we use an autoencoder, we train it on our dataset. Then, if we want to predict whether an example is an outlier, we can use the autoencoder model to reconstruct the example from the bottleneck layer. The model will unlikely be capable of reconstructing an outlier.

在一类分类中,模型要么预测输入示例属于该类,要么预测它是异常值。

In one-class classification, the model either predicts that the input example belongs to the class, or it’s an outlier.


  1. 一些分析师查看多个二维图,其中同时存在一对特征。它可能会给出关于簇数量的直觉。然而,这种方法存在主观性,容易出错,只能算作有根据的猜测,而不是科学方法。

  2. Some analysts look at multiple two-dimensional plots, in which only a pair of features are present at the same time. It might give an intuition about the number of clusters. However, such an approach suffers from subjectivity, is prone to error and counts as an educated guess rather than a scientific method.

10其他形式的学习

10 Other Forms of Learning

10.1度量学习

10.1 Metric Learning

我提到两个特征向量之间最常用的相似性(或相异性)度量是欧几里德距离余弦相似性。这种度量的选择看似合乎逻辑,但却是任意的,就像线性回归中平方误差的选择(或线性回归本身的形式)一样。事实上,根据数据集的不同,一个指标可以比另一个指标更好,这一事实表明它们都不是完美的。

I mentioned that the most frequently used metrics of similarity (or dissimilarity) between two feature vectors are Euclidean distance and cosine similarity. Such choices of metric seem logical but arbitrary, just like the choice of the squared error in linear regression (or the form of linear regression itself). The fact that one metric can work better than another depending on the dataset is an indicator that none of them are perfect.

您可以创建一个更适合您的数据集的指标。然后可以将您的指标集成到任何需要指标的学习算法中,例如 k-means 或 kNN。在不尝试所有可能性的情况下,您如何知道哪个方程是一个好的度量标准?正如您已经猜到的,可以从数据中学习度量。

You can create a metric that would work better for your dataset. It’s then possible to integrate your metric into any learning algorithm that needs a metric, like k-means or kNN. How can you know, without trying all possibilities, which equation would be a good metric? As you could already guess, a metric can be learned from data.

记住两个特征向量之间的欧几里得距离𝐱\mathbf{x}𝐱\mathbf{x}':

Remember the Euclidean distance between two feature vectors 𝐱\mathbf{x} and 𝐱\mathbf{x}':

d𝐱,𝐱=𝐱-𝐱=定义𝐱-𝐱2=𝐱-𝐱𝐱-𝐱 \begin{aligned} d(\mathbf{x},\mathbf{x}') &= \|\mathbf{x} - \mathbf{x}'\| \\ &\stackrel{\text{def}}{=} \sqrt{(\mathbf{x} - \mathbf{x}')^2} \\ &= \sqrt{(\mathbf{x} - \mathbf{x}')(\mathbf{x} - \mathbf{x}')}.\end{aligned}

d(𝐱,𝐱)=𝐱𝐱=def(𝐱𝐱)2=(𝐱𝐱)(𝐱𝐱). \begin{aligned} d(\mathbf{x},\mathbf{x}') &= \|\mathbf{x} - \mathbf{x}'\| \\ &\stackrel{\text{def}}{=} \sqrt{(\mathbf{x} - \mathbf{x}')^2} \\ &= \sqrt{(\mathbf{x} - \mathbf{x}')(\mathbf{x} - \mathbf{x}')}.\end{aligned}

我们可以稍微修改这个指标以使其可参数化,然后从数据中学习这些参数。考虑以下修改:

We can slightly modify this metric to make it parametrizable and then learn these parameters from data. Consider the following modification:

d𝐀𝐱,𝐱=𝐱-𝐱𝐀=定义𝐱-𝐱𝐀𝐱-𝐱, \begin{aligned} d_{\mathbf{A}}(\mathbf{x},\mathbf{x}') &= \|\mathbf{x}-\mathbf{x}'\|_{\mathbf{A}} \\ &\stackrel{\text{def}}{=} \sqrt{(\mathbf{x} - \mathbf{x}')^{\top}\mathbf{A}(\mathbf{x} - \mathbf{x}')}, \end{aligned}

d𝐀(𝐱,𝐱)=𝐱𝐱𝐀=def(𝐱𝐱)𝐀(𝐱𝐱), \begin{aligned} d_{\mathbf{A}}(\mathbf{x},\mathbf{x}') &= \|\mathbf{x}-\mathbf{x}'\|_{\mathbf{A}} \\ &\stackrel{\text{def}}{=} \sqrt{(\mathbf{x} - \mathbf{x}')^{\top}\mathbf{A}(\mathbf{x} - \mathbf{x}')}, \end{aligned}

在哪里𝐀\mathbf{A}是一个D×DD \times D矩阵。比方说D=3D = 3。如果我们让𝐀\mathbf{A}是单位矩阵,

where 𝐀\mathbf{A} is a D×DD \times D matrix. Let’s say D=3D = 3. If we let 𝐀\mathbf{A} be the identity matrix,

𝐀=定义[100010001], \mathbf{A} \stackrel{\text{def}}{=} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix},

𝐀=def[100010001], \mathbf{A} \stackrel{\text{def}}{=} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix},

然后d𝐀d_{\mathbf{A}}变为欧几里得距离。如果我们有一个通用的对角矩阵,如下所示:

then d𝐀d_{\mathbf{A}} becomes the Euclidean distance. If we have a general diagonal matrix, like this:

𝐀=定义[200080001], \mathbf{A} \stackrel{\text{def}}{=} \begin{bmatrix} 2 & 0 & 0 \\ 0 & 8 & 0 \\ 0 & 0 & 1 \end{bmatrix},

𝐀=def[200080001], \mathbf{A} \stackrel{\text{def}}{=} \begin{bmatrix} 2 & 0 & 0 \\ 0 & 8 & 0 \\ 0 & 0 & 1 \end{bmatrix},

那么不同的维度在度量中具有不同的重要性。 (在上面的示例中,第二个维度是度量计算中最重要的。)更一般而言,要称为度量,两个变量的函数必须满足三个条件:

then different dimensions have different importance in the metric. (In the above example, the second dimension is the most important in the metric calculation.) More generally, to be called a metric a function of two variables has to satisfy three conditions:

1d𝐱,𝐱0非负性,2d𝐱,𝐱d𝐱,𝐳+d𝐳,𝐱三角不等式,3d𝐱,𝐱=d𝐱,𝐱对称。 \begin{array}{lll} 1. & d(\mathbf{x},\mathbf{x}')\geq 0 & \mbox{nonnegativity,} \\ 2. & d(\mathbf{x},\mathbf{x}') \leq d(\mathbf{x},\mathbf{z}) + d(\mathbf{z},\mathbf{x'}) & \mbox{triangle inequality,} \\ 3. & d(\mathbf{x},\mathbf{x}')=d(\mathbf{x}',\mathbf{x}) & \mbox{symmetry.} \end{array}

1.d(𝐱,𝐱)0nonnegativity,2.d(𝐱,𝐱)d(𝐱,𝐳)+d(𝐳,𝐱)triangle inequality,3.d(𝐱,𝐱)=d(𝐱,𝐱)symmetry. \begin{array}{lll} 1. & d(\mathbf{x},\mathbf{x}')\geq 0 & \mbox{nonnegativity,} \\ 2. & d(\mathbf{x},\mathbf{x}') \leq d(\mathbf{x},\mathbf{z}) + d(\mathbf{z},\mathbf{x'}) & \mbox{triangle inequality,} \\ 3. & d(\mathbf{x},\mathbf{x}')=d(\mathbf{x}',\mathbf{x}) & \mbox{symmetry.} \end{array}

为了满足前两个条件,矩阵𝐀\mathbf{A}必须是半正定的。您可以将正半定矩阵视为非负实数概念到矩阵的推广。任意正半定矩阵𝐌\mathbf{M}满足:

To satisfy the first two conditions, the matrix 𝐀\mathbf{A} has to be positive semidefinite. You can see a positive semidefinite matrix as the generalization of the notion of a nonnegative real number to matrices. Any positive semidefinite matrix 𝐌\mathbf{M} satisfies:

𝐳𝐌𝐳0, \mathbf{z}^{\top}\mathbf{M}\mathbf{z}\geq 0,

𝐳𝐌𝐳0, \mathbf{z}^{\top}\mathbf{M}\mathbf{z}\geq 0,

对于任意向量𝐳\mathbf{z}具有与行数和列数相同的维度𝐌\mathbf{M}

for any vector 𝐳\mathbf{z} having the same dimensionality as the number of rows and columns in 𝐌\mathbf{M}.

上述性质由半正定矩阵的定义得出。证明当矩阵满足第二个条件𝐀\mathbf{A}是半正定的,可以在本书的姊妹网站上找到。

The above property follows from the definition of a positive semidefinite matrix. The proof that the second condition is satisfied when the matrix 𝐀\mathbf{A} is positive semidefinite can be found on the book’s companion website.

为了满足第三个条件,我们可以简单地取d𝐱,𝐱+d𝐱,𝐱/2(d(\mathbf{x},\mathbf{x}') + d(\mathbf{x}',\mathbf{x}))/2

To satisfy the third condition, we can simply take (d(𝐱,𝐱)+d(𝐱,𝐱))/2(d(\mathbf{x},\mathbf{x}') + d(\mathbf{x}',\mathbf{x}))/2.

假设我们有一个未注释的集合𝒳={𝐱}=1\mathcal{X} = \{\mathbf{x}_i\}_{i=1}^N。为了构建度量学习问题的训练数据,我们手动创建两组。第一组𝒮\mathcal{S}是这样的,有一对例子𝐱,𝐱k(\mathbf{x}_i, \mathbf{x}_k)属于集合𝒮\mathcal{S}如果𝐱\mathbf{x}_i𝐱k\mathbf{x}_k相似(从我们的主观角度来看)。第二组𝒟\mathcal{D}是这样的,有一对例子𝐱,𝐱k(\mathbf{x}_i, \mathbf{x}_k)属于集合𝒟\mathcal{D}如果𝐱\mathbf{x}_i𝐱k\mathbf{x}_k是不同的。

Let’s say we have an unannotated set 𝒳={𝐱i}i=1N\mathcal{X} = \{\mathbf{x}_i\}_{i=1}^N. To build the training data for our metric learning problem, we manually create two sets. The first set 𝒮\mathcal{S} is such that a pair of examples (𝐱i,𝐱k)(\mathbf{x}_i, \mathbf{x}_k) belongs to set 𝒮\mathcal{S} if 𝐱i\mathbf{x}_i and 𝐱k\mathbf{x}_k are similar (from our subjective perspective). The second set 𝒟\mathcal{D} is such that a pair of examples (𝐱i,𝐱k)(\mathbf{x}_i, \mathbf{x}_k) belongs to set 𝒟\mathcal{D} if 𝐱i\mathbf{x}_i and 𝐱k\mathbf{x}_k are dissimilar.

训练参数矩阵𝐀\mathbf{A}从数据中,我们想要找到一个正半定矩阵𝐀\mathbf{A}解决以下优化问题:

To train the matrix of parameters 𝐀\mathbf{A} from the data, we want to find a positive semidefinite matrix 𝐀\mathbf{A} that solves the following optimization problem:

分钟𝐀Σ𝐱,𝐱kε𝒮𝐱-𝐱𝐀2 \min_{\mathbf{A}} \sum_{(\mathbf{x}_i, \mathbf{x}_k) \in \mathcal{S}} \|\mathbf{x}-\mathbf{x}'\|^2_{\mathbf{A}}

min𝐀(𝐱i,𝐱k)𝒮𝐱𝐱𝐀2 \min_{\mathbf{A}} \sum_{(\mathbf{x}_i, \mathbf{x}_k) \in \mathcal{S}} \|\mathbf{x}-\mathbf{x}'\|^2_{\mathbf{A}}

这样:

such that:

Σ𝐱,𝐱kε𝒟𝐱-𝐱𝐀C, \sum_{(\mathbf{x}_i, \mathbf{x}_k) \in \mathcal{D}} \|\mathbf{x}-\mathbf{x}'\|_{\mathbf{A}} \geq c,

(𝐱i,𝐱k)𝒟𝐱𝐱𝐀c, \sum_{(\mathbf{x}_i, \mathbf{x}_k) \in \mathcal{D}} \|\mathbf{x}-\mathbf{x}'\|_{\mathbf{A}} \geq c,

在哪里Cc是一个正常数(可以是任何数字)。

where cc is a positive constant (can be any number).

该优化问题的解决方案是通过梯度下降找到的,并进行修改以确保找到的矩阵𝐀\mathbf{A}是半正定的。我们将算法的描述排除在本书的范围之外以供进一步阅读。

The solution to this optimization problem is found by gradient descent with a modification that ensures that the found matrix 𝐀\mathbf{A} is positive semidefinite. We leave the description of the algorithm out of the scope of this book for further reading.

我应该指出,使用孪生网络三元组损失的一次性学习可以被视为度量学习问题:同一个人的成对图片属于该集合𝒮\mathcal{S},而随机图片对属于𝒟\mathcal{D}

I should point out that one-shot learning with siamese networks and triplet loss can be seen as metric learning problem: the pairs of pictures of the same person belong to the set 𝒮\mathcal{S}, while pairs of random pictures belong to 𝒟\mathcal{D}.

还有许多其他方法可以学习度量,包括非线性和基于内核的方法。然而,本书中介绍的内容以及一次性学习的改编应该足以满足大多数实际应用。

There are many other ways to learn a metric, including non-linear and kernel-based. However, the one presented in this book, as well as the adaptation of one-shot learning, should suffice for most practical applications.

10.2学习排名

10.2 Learning to Rank

学习排名是一个监督学习问题。其中,使用学习排名解决的一个常见问题是优化搜索引擎针对查询返回的搜索结果。在搜索结果排名优化中,一个带标签的例子𝒳\mathcal{X}_i在训练集中的大小N是大小文档的排序集合rr_i(标签是文档的等级)。特征向量代表集合中的每个文档。学习的目标是找到一个排名函数Ff它输出可用于对文档进行排名的值。对于每个训练示例,都有一个理想函数Ff将输出导致与标签给出的文档排名相同的值。

Learning to rank is a supervised learning problem. Among others, one frequent problem solved using learning to rank is the optimization of search results returned by a search engine for a query. In search result ranking optimization, a labeled example 𝒳i\mathcal{X}_i in the training set of size NN is a ranked collection of documents of size rir_i (labels are ranks of documents). A feature vector represents each document in the collection. The goal of the learning is to find a ranking function ff which outputs values that can be used to rank documents. For each training example, an ideal function ff would output values that induce the same ranking of documents as given by the labels.

每个例子𝒳\mathcal{X}_i,=1,……,i=1,\ldots,N,是带有标签的特征向量的集合:𝒳={𝐱,j,y,j}j=1r\mathcal{X}_i = \{(\mathbf{x}_{i,j}, y_{i,j})\}_{j=1}^{r_i}。特征向量中的特征𝐱,j\mathbf{x}_{i,j}代表文档j=1,……,rj = 1, \ldots, r_i。例如,X,j1x_{i,j}^{(1)}可以代表该文档的最新程度,X,j2x_{i,j}^{(2)}将反映查询的单词是否可以在文档标题中找到,X,j3x_{i,j}^{(3)}可以表示文档的大小等等。标签y,jy_{i,j}可能是排名(1,2,……,r1, 2, \ldots, r_i)或分数。例如,分数越低,文档的排名就应该越高。

Each example 𝒳i\mathcal{X}_i, i=1,,Ni=1,\ldots,N, is a collection of feature vectors with labels: 𝒳i={(𝐱i,j,yi,j)}j=1ri\mathcal{X}_i = \{(\mathbf{x}_{i,j}, y_{i,j})\}_{j=1}^{r_i}. Features in a feature vector 𝐱i,j\mathbf{x}_{i,j} represent the document j=1,,rij = 1, \ldots, r_i. For example, xi,j(1)x_{i,j}^{(1)} could represent how recent is the document, xi,j(2)x_{i,j}^{(2)} would reflect whether the words of the query can be found in the document title, xi,j(3)x_{i,j}^{(3)} could represent the size of the document, and so on. The label yi,jy_{i,j} could be the rank (1,2,,ri1, 2, \ldots, r_i) or a score. For example, the lower the score, the higher the document should be ranked.

有三种方法可以解决该问题:逐点成对列表

There are three approaches to solve that problem: pointwise, pairwise, and listwise.

逐点方法将每个训练示例转换为多个示例:每个文档一个示例。学习问题成为标准的监督学习问题,要么是回归问题,要么是逻辑回归问题。在每个例子中𝐱,y(\mathbf{x},y)的逐点学习问题,𝐱\mathbf{x}是某个文档的特征向量,并且yy是原始分数(如果y,jy_{i,j}是分数)或者是排名得到的综合分数(排名越高,综合分数越低)。在这种情况下可以使用任何监督学习算法。该解决方案通常远非完美。原则上,这是因为每个文档都是孤立考虑的,而原始排名(由标签给出)y,jy_{i,j}原始训练集)可以优化整个文档集的位置。例如,如果我们已经在某些文档集合中为维基百科页面提供了高排名,那么我们宁愿不为同一查询的另一个维基百科页面提供高排名。

The pointwise approach transforms each training example into multiple examples: one example per document. The learning problem becomes a standard supervised learning problem, either regression or logistic regression. In each example (𝐱,y)(\mathbf{x},y) of the pointwise learning problem, 𝐱\mathbf{x} is the feature vector of some document, and yy is the original score (if yi,jy_{i,j} is a score) or a synthetic score obtained from the ranking (the higher the rank, the lower the synthetic score). Any supervised learning algorithm can be used in this case. The solution is usually far from perfect. Principally, this is because each document is considered in isolation, while the original ranking (given by the labels yi,jy_{i,j} of the original training set) could optimize the positions of the whole set of documents. For example, if we have already given a high rank to a Wikipedia page in some collection of documents, we would prefer not giving a high rank to another Wikipedia page for the same query.

在成对方法中,我们还单独考虑文档,但是在这种情况下,会同时考虑一对文档。给定一对文档𝐱,𝐱k(\mathbf{x}_i,\mathbf{x}_k)我们建立一个模型Ff,其中,给定𝐱,𝐱k(\mathbf{x}_i,\mathbf{x}_k)作为输入,输出一个接近于11, 如果𝐱\mathbf{x}_i应该高于𝐱k\mathbf{x}_k在排名中;否则,Ff输出一个接近于00。在测试时,未标记示例的最终排名𝒳\mathcal{X}通过聚合所有文档对的预测来获得𝒳\mathcal{X}。成对方法比逐点方法效果更好,但仍远非完美。

In the pairwise approach, we also consider documents in isolation, but, in this case, a pair of documents is considered at once. Given a pair of documents (𝐱i,𝐱k)(\mathbf{x}_i,\mathbf{x}_k) we build a model ff, which, given (𝐱i,𝐱k)(\mathbf{x}_i,\mathbf{x}_k) as input, outputs a value close to 11, if 𝐱i\mathbf{x}_i should be higher than 𝐱k\mathbf{x}_k in the ranking; otherwise, ff outputs a value close to 00. At the test time, the final ranking for an unlabeled example 𝒳\mathcal{X} is obtained by aggregating the predictions for all pairs of documents in 𝒳\mathcal{X}. The pairwise approach works better than pointwise, but is still far from perfect.

最先进的排名学习算法(例如LambdaMART)实现了列表方法。在列表方法中,我们尝试直接根据一些反映排名质量的指标来优化模型。评估搜索引擎结果排名的指标有多种,包括精确度和召回率。一种结合了精度和召回率的流行指标称为平均精度(MAP)。

The state of the art rank learning algorithms, such as LambdaMART, implement the listwise approach. In the listwise approach, we try to optimize the model directly on some metric that reflects the quality of ranking. There are various metrics for assessing search engine result ranking, including precision and recall. One popular metric that combines both precision and recall is called mean average precision (MAP).

为了定义 MAP,让我们要求法官(Google 称这些人为排名者)检查某个查询的搜索结果集合,并为每个搜索结果分配相关性标签。标签可以是二进制的(11对于“相关”和00“不相关”)或在某种程度上,例如1155:值越高,文档与搜索查询越相关。让我们的法官为一系列内容建立这样的相关性标签100100查询。现在,让我们在这个集合上测试我们的排名模型。我们的模型对于某些查询的精度由下式给出:

To define MAP, let us ask judges (Google call those people rankers) to examine a collection of search results for a query and assign relevancy labels to each search result. Labels could be binary (11 for “relevant” and 00 for “irrelevant”) or on some scale, say from 11 to 55: the higher the value, the more relevant the document is to the search query. Let our judges build such relevancy labeling for a collection of 100100 queries. Now, let us test our ranking model on this collection. The precision of our model for some query is given by:

精确=|{相对。文档}{ret。文档}||{ret。文档}|, \mbox{precision}=\frac{|\{\mbox{rel. docs}\}\cap\{\mbox{ret. docs}\}|}{|\{\mbox{ret. docs}\}|},

precision=|{rel. docs}{ret. docs}||{ret. docs}|, \mbox{precision}=\frac{|\{\mbox{rel. docs}\}\cap\{\mbox{ret. docs}\}|}{|\{\mbox{ret. docs}\}|},

在哪里redCsrel. docs代表“相关文件”,retdCsret. docs代表“检索的文档”,符号|||\cdot|意思是“数量”。平均精度度量AveP 是为搜索引擎针对查询返回的排序文档集合定义的qq作为,

where rel.docsrel. docs stands for “relevant documents”, ret.docsret. docs stands for “retrieved docs”, and the notation |||\cdot| means “the number of.” The average precision metric, AveP, is defined for a ranked collection of documents returned by a search engine for a query qq as,

平均Pq=Σk=1nk相对k|{相对。文档}|, \operatorname{AveP}(q) = \frac{\sum_{k=1}^n (P(k) \cdot \operatorname{rel}(k))}{|\{\mbox{rel. docs}\}|},

AveP(q)=k=1n(P(k)rel(k))|{rel. docs}|, \operatorname{AveP}(q) = \frac{\sum_{k=1}^n (P(k) \cdot \operatorname{rel}(k))}{|\{\mbox{rel. docs}\}|},

在哪里nn是检索到的文档数量,kP(k)表示顶部计算的精度kk我们的排名模型针对查询返回的搜索结果,相对k\operatorname{rel}(k)是一个指示函数,等于11如果该项目处于排名kk是相关文件(根据法官的判断),否则为零。最后,大小搜索查询集合的 MAPQ是(谁)给的,

where nn is the number of retrieved documents, P(k)P(k) denotes the precision computed for the top kk search results returned by our ranking model for the query, rel(k)\operatorname{rel}(k) is an indicator function equaling 11 if the item at rank kk is a relevant document (according to judges) and zero otherwise. Finally, the MAP for a collection of search queries of size QQ is given by,

地图=Σq=1平均P(q) \operatorname{MAP} = \frac{\sum_{q=1}^Q \operatorname{AveP(q)}}{Q}.

MAP=q=1QAveP(q)Q. \operatorname{MAP} = \frac{\sum_{q=1}^Q \operatorname{AveP(q)}}{Q}.

现在我们回到 LambdaMART。该算法实现了列表方式,并使用梯度提升来训练排名函数H𝐱h(\mathbf{x})。然后,二元模型F𝐱,𝐱kf(\mathbf{x}_i, \mathbf{x}_k)预测该文档是否𝐱\mathbf{x}_i应该比文档有更高的排名𝐱k\mathbf{x}_k(对于相同的搜索查询)由带有超参数的 sigmoid 给出α\alpha,

Now we get back to LambdaMART. This algorithm implements a listwise approach, and it uses gradient boosting to train the ranking function h(𝐱)h(\mathbf{x}). Then, the binary model f(𝐱i,𝐱k)f(\mathbf{x}_i, \mathbf{x}_k) that predicts whether the document 𝐱i\mathbf{x}_i should have a higher rank than the document 𝐱k\mathbf{x}_k (for the same search query) is given by a sigmoid with a hyperparameter α\alpha,

F𝐱,𝐱k=定义11+经验值H𝐱𝐢-H𝐱𝐤α f(\mathbf{x}_i, \mathbf{x}_k) \stackrel{\text{def}}{=} \frac{1}{1 +\exp((h(\mathbf{x_i}) - h(\mathbf{x_k}))\alpha}.

f(𝐱i,𝐱k)=def11+exp((h(𝐱𝐢)h(𝐱𝐤))α. f(\mathbf{x}_i, \mathbf{x}_k) \stackrel{\text{def}}{=} \frac{1}{1 +\exp((h(\mathbf{x_i}) - h(\mathbf{x_k}))\alpha}.

同样,与许多预测概率的模型一样,成本函数是使用模型计算的交叉熵Ff。在梯度提升中,我们结合多个回归树来构建函数Hh通过尝试最小化成本。请记住,在梯度提升中,我们向模型添加一棵树,以减少当前模型在训练数据上产生的误差。对于分类问题,我们计算了成本函数的导数,以用这些导数替换训练示例的真实标签。 LambdaMART 的工作原理类似,但有一个例外。它用梯度和另一个取决于度量的因素(例如 MAP)的组合来替换真实梯度。该因子通过增加或减少原始梯度来修改原始梯度,从而提高度量值。

Again, as with many models that predict probability, the cost function is cross-entropy computed using the model ff. In our gradient boosting, we combine multiple regression trees to build the function hh by trying to minimize the cost. Remember that in gradient boosting we add a tree to the model to reduce the error that the current model makes on the training data. For the classification problem, we computed the derivative of the cost function to replace real labels of training examples with these derivatives. LambdaMART works similarly, with one exception. It replaces the real gradient with a combination of the gradient and another factor that depends on the metric, such as MAP. This factor modifies the original gradient by increasing or decreasing it so that the metric value is improved.

这是一个非常聪明的想法,没有多少监督学习算法可以夸耀它们直接优化指标。优化指标是我们真正想要的,但在典型的监督学习算法中我们所做的是优化成本而不是指标(因为指标通常是不可微分的)。通常,在监督学习中,一旦我们找到优化成本函数的模型,我们就会尝试调整超参数以提高指标的值。 LambdaMART 直接优化指标。

That is a very bright idea and not many supervised learning algorithms can boast that they optimize a metric directly. Optimizing a metric is what we really want, but what we do in a typical supervised learning algorithm is we optimize the cost instead of the metric (because metrics are usually not differentiable). Usually, in supervised learning, as soon as we have found a model that optimizes the cost function, we try to tweak hyperparameters to improve the value of the metric. LambdaMART optimizes the metric directly.

剩下的问题是我们如何根据模型的预测构建结果的排名列表Ff它预测其第一个输入的排名是否必须高于第二个输入。这通常是一个计算难题,并且有多种排名器实现能够将成对比较转换为排名列表。

The remaining question is how do we build the ranked list of results based on the predictions of the model ff which predicts whether its first input has to be ranked higher than the second input. It’s generally a computationally hard problem, and there are multiple implementations of rankers capable of transforming pairwise comparisons into a ranking list.

最直接的方法是使用现有的排序算法。排序算法按升序或降序对数字集合进行排序。 (最简单的排序算法称为冒泡排序。它通常在工程学校教授。)通常,排序算法会迭代地比较集合中的一对数字,并根据比较结果更改它们在列表中的位置。如果我们插入我们的函数Ff进入排序算法来执行此比较,排序算法将排序文档而不是数字。

The most straightforward approach is to use an existing sorting algorithm. Sorting algorithms sort a collection of numbers in increasing or decreasing order. (The simplest sorting algorithm is called bubble sort. It’s usually taught in engineering schools.) Typically, sorting algorithms iteratively compare a pair of numbers in the collection and change their positions in the list based on the result of that comparison. If we plug our function ff into a sorting algorithm to execute this comparison, the sorting algorithm will sort documents and not numbers.

10.3学习推荐

10.3 Learning to Recommend

学习推荐是构建推荐系统的一种方法。通常,我们有一个消费内容的用户。我们有消费历史,并希望向该用户推荐他们喜欢的新内容。它可以是 Netflix 上的电影或亚马逊上的一本书。

Learning to recommend is an approach to building recommender systems. Usually, we have a user who consumes content. We have the history of consumption and want to suggest new content to this user that they would like. It could be a movie on Netflix or a book on Amazon.

传统上,使用两种方法来提供推荐:基于内容的过滤协作过滤

Traditionally, two approaches were used to give recommendations: content-based filtering and collaborative filtering.

基于内容的过滤包括根据用户消费内容的描述来了解用户喜欢什么。例如,如果新闻网站的用户经常阅读科技新闻文章,那么我们会向该用户推荐更多科技文档。更一般地说,我们可以为每个用户创建一个训练集,并将新闻文章作为特征向量添加到该数据集中𝐱\mathbf{x}以及用户最近是否阅读过这篇新闻文章作为标签yy。然后我们构建每个用户的模型,并可以定期检查每个新内容以确定特定用户是否会阅读它。

Content-based filtering consists of learning what users like based on the description of the content they consume. For example, if the user of a news site often reads news articles on science and technology, then we would suggest more documents on science and technology to this user. More generally, we could create one training set per user and add news articles to this dataset as a feature vector 𝐱\mathbf{x} and whether the user recently read this news article as a label yy. Then we build the model of each user and can regularly examine each new piece of content to determine whether a specific user would read it or not.

基于内容的方法有很多局限性。例如,用户可能会陷入所谓的过滤气泡中:系统总是会向该用户建议看起来与用户已经消费的信息非常相似的信息。这可能会导致用户与不同意其观点或扩展其观点的信息完全隔离。从更实际的角度来看,用户可能会停止遵循推荐,这是不可取的。

The content-based approach has many limitations. For example, the user can be trapped in the so-called filter bubble: the system will always suggest to that user the information that looks very similar to what user already consumed. That could result in complete isolation of the user from information that disagrees with their viewpoints or expands them. On a more practical side, the users might just stop following recommendations, which is undesirable.

与基于内容的过滤相比,协作过滤具有显着优势:对一个用户的推荐是根据其他用户的消费或评分来计算的。例如,如果两个用户对相同的十部电影给予高评价,则用户 1 更有可能欣赏基于用户 2 的口味推荐的新电影,反之亦然。这种方法的缺点是忽略了推荐项目的内容。

Collaborative filtering has a significant advantage over content-based filtering: the recommendations to one user are computed based on what other users consume or rate. For instance, if two users gave high ratings to the same ten movies, then it’s more likely that user 1 will appreciate new movies recommended based on the tastes of the user 2 and vice versa. The drawback of this approach is that the content of the recommended items is ignored.

在协同过滤中,有关用户偏好的信息被组织在矩阵中。每一行对应一个用户,每一列对应用户评分或消费的一段内容。通常,这个矩阵巨大且极其稀疏,这意味着它的大部分单元格都未填充(或填充零)。造成这种稀疏性的原因是大多数用户仅消费或评价可用内容项的一小部分。基于如此稀疏的数据很难提出有意义的建议。

In collaborative filtering, the information on user preferences is organized in a matrix. Each row corresponds to a user, and each column corresponds to a piece of content that user rated or consumed. Usually, this matrix is huge and extremely sparse, which means that most of its cells aren’t filled (or filled with a zero). The reason for such a sparsity is that most users consume or rate just a tiny fraction of available content items. It’s is very hard to make meaningful recommendations based on such sparse data.

大多数现实世界的推荐系统都使用混合方法:它们结合了基于内容和协作过滤模型获得的推荐。

Most real-world recommender systems use a hybrid approach: they combine recommendations obtained by the content-based and collaborative filtering models.

我已经提到,可以使用分类或回归模型来构建基于内容的推荐模型,该模型根据内容的特征来预测用户是否会喜欢该内容。特征的示例可以包括用户喜欢的书籍或新闻文章中的单词、价格、内容的新近度、内容作者的身份等等。

I already mentioned that a content-based recommender model could be built using a classification or regression model that predicts whether a user will like the content based on the content’s features. Examples of features could include the words in books or news articles the user liked, the price, the recency of the content, the identity of the content author and so on.

两种有效的推荐系统学习算法是因子分解机(FM)和去噪自动编码器(DAE)。

Two effective recommender system learning algorithms are factorization machines (FM) and denoising autoencoders (DAE).

10.3.1因式分解机

10.3.1 Factorization Machines

因式分解机是一种相对较新的算法。它是专门为稀疏数据集而设计的。让我们来说明一下这个问题。

Factorization machine is a relatively new kind of algorithm. It was explicitly designed for sparse datasets. Let’s illustrate the problem.

图 55:稀疏特征向量 \mathbf{x} 及其各自标签 y 的示例。
图 55:稀疏特征向量示例𝐱\mathbf{x}以及各自的标签yy

在图中。  55您会看到带有标签的稀疏特征向量的示例。每个特征向量表示有关一个特定用户和一部特定电影的信息。蓝色部分中的特征代表用户。用户被编码为 one-hot 向量。绿色部分的特征代表一部电影。电影也被编码为 one-hot 向量。黄色部分中的特征代表蓝色用户对他们评分的每部电影给出的标准化分数。特征X99x_{99}表示用户观看过的获得奥斯卡奖的电影的比例。特征X100x_{100}表示用户在对电影评分为绿色之前观看的蓝色电影的百分比。目标yy蓝色用户给绿色电影的评分。

In fig. 55 you see an example of sparse feature vectors with labels. Each feature vector represents information about one specific user and one specific movie. Features in the blue section represent a user. Users are encoded as one-hot vectors. Features in the green section represent a movie. Movies are also encoded as one-hot vectors. Features in the yellow section represent normalized scores the user in blue gave to each movie they rated. Feature x99x_{99} represents the ratio of movies with an Oscar among those the user has watched. Feature x100x_{100} represents the percentage of the movie watched by the user in blue before they scored the movie in green. The target yy is the score given by the user in blue to the movie in green.

真正的推荐系统可以拥有数百万用户,因此图中的矩阵可以有数亿行。特征的数量也可能是数百万,这取决于内容选择的丰富程度以及作为数据分析师的你在特征工程方面的创造力。特征X99x_{99}X100x_{100}是在特征工程过程中手工制作的,为了说明目的,我只展示了两个特征。

Real recommender systems can have millions of users, so the matrix in Figure can count hundreds of millions of rows. The number of features could also be millions, depending on how rich is the choice of content is and how creative you, as a data analyst, are in feature engineering. Features x99x_{99} and x100x_{100} were handcrafted during the feature engineering process, and I only show two features for the purposes of illustration.

尝试将回归或分类模型拟合到如此极其稀疏的数据集将导致泛化能力较差。分解机以不同的方式处理这个问题。

Trying to fit a regression or classification model to such an extremely sparse dataset would result in poor generalization. Factorization machines approach this problem differently.

分解机模型定义如下:

The factorization machine model is defined as follows:

F𝐱=定义+Σ=1DwX+Σ=1DΣj=+1D𝐯𝐯jXXj, \begin{split} f(\mathbf{x}) \stackrel{\text{def}}{=} b +\sum_{i=1}^{D}w_{i}x_{i} \\ +\sum_{i=1}^{D}\sum_{j=i+1}^{D} (\mathbf{v}_{i}\mathbf{v}_{j}) x_{i}x_{j}, \end{split}

f(𝐱)=defb+i=1Dwixi+i=1Dj=i+1D(𝐯i𝐯j)xixj, \begin{split} f(\mathbf{x}) \stackrel{\text{def}}{=} b +\sum_{i=1}^{D}w_{i}x_{i} \\ +\sum_{i=1}^{D}\sum_{j=i+1}^{D} (\mathbf{v}_{i}\mathbf{v}_{j}) x_{i}x_{j}, \end{split}

在哪里bww_i,=1,……,Di=1,\ldots,D,是与线性回归中使用的标量参数类似的标量参数。向量𝐯\mathbf{v}_ikk因素的维向量。kk是一个超参数,通常比DD。表达方式𝐯𝐯j\mathbf{v}_{i}\mathbf{v}_{j}是一个点积thi^{\textrm{th}}jthj^{\textrm{th}}因素的向量。正如您所看到的,我们不是寻找一个宽参数向量,因为稀疏性,它不能很好地反映特征之间的交互,而是通过适用于成对交互的附加参数来完成它XXjx_{i}x_{j}特征之间。然而,而不是有一个参数w,jw_{i,j}对于每次交互,这都会向模型添加大量新参数,我们将其分解w,jw_{i,j}进入𝐯𝐯j\mathbf{v}_i\mathbf{v}_j仅添加DkDD-1Dk \ll D(D-1)模型参数2

where bb and wiw_i, i=1,,Di=1,\ldots,D, are scalar parameters similar to those used in linear regression. Vectors 𝐯i\mathbf{v}_i are kk-dimensional vectors of factors. kk is a hyperparameter and is usually much smaller than DD. The expression 𝐯i𝐯j\mathbf{v}_{i}\mathbf{v}_{j} is a dot-product of the ithi^{\textrm{th}} and jthj^{\textrm{th}} vectors of factors. As you can see, instead looking for one wide vector of parameters, which can reflect interactions between features poorly because of sparsity, we complete it by additional parameters that apply to pairwise interactions xixjx_{i}x_{j} between features. However, instead of having a parameter wi,jw_{i,j} for each interaction, which would add an enormous1 quantity of new parameters to the model, we factorize wi,jw_{i,j} into 𝐯i𝐯j\mathbf{v}_i\mathbf{v}_j by adding only DkD(D1)Dk \ll D(D-1) parameters to the model2.

根据问题的不同,损失函数可以是平方误差损失(用于回归)或铰链损失。用于分类yε{-1,+1}y \in \{-1,+1\},对于铰链损失或逻辑损失,预测为y=符号FXy = \operatorname{sign}(f(x))。物流损失定义为,

Depending on the problem, the loss function could be squared error loss (for regression) or hinge loss. For classification with y{1,+1}y \in \{-1,+1\}, with hinge loss or logistic loss the prediction is made as y=sign(f(x))y = \operatorname{sign}(f(x)). The logistic loss is defined as,

ssF𝐱,y=121+e-yF𝐱 loss(f({\mathbf{x}}),y)={\frac{1}{\ln 2}}\ln(1+e^{{-yf({\mathbf{x}})}}).

loss(f(𝐱),y)=1ln2ln(1+eyf(𝐱)). loss(f({\mathbf{x}}),y)={\frac{1}{\ln 2}}\ln(1+e^{{-yf({\mathbf{x}})}}).

梯度下降可用于优化平均损失。在图中的例子中。  55、标签在{1,2,3,4,5}\{1,2,3,4,5\},所以这是一个多类问题。我们可以使用一与休息策略将这个多类问题转换为五个二元分类问题。

Gradient descent can be used to optimize the average loss. In the example in fig. 55, the labels are in {1,2,3,4,5}\{1,2,3,4,5\}, so it’s a multiclass problem. We can use the one versus rest strategy to convert this multiclass problem into five binary classification problems.

10.3.2去噪自动编码器

10.3.2 Denoising Autoencoders

从第 7 章中,您知道什么是去噪自动编码器:它是一个从瓶颈层重建输入的神经网络。输入会被噪声破坏,而输出不应被噪声破坏,这一事实使得去噪自动编码器成为构建推荐模型的理想工具。

From Chapter 7, you know what a denoising autoencoder is: it’s a neural network that reconstructs its input from the bottleneck layer. The fact that the input is corrupted by noise while the output shouldn’t be makes denoising autoencoders an ideal tool to build a recommender model.

这个想法非常简单:用户可能喜欢的新电影看起来就像是通过某种损坏过程从完整的首选电影集中删除的。去噪自动编码器的目标是重建那些被删除的项目。

The idea is very straightforward: new movies a user could like are seen as if they were removed from the complete set of preferred movies by some corruption process. The goal of the denoising autoencoder is to reconstruct those removed items.

要为我们的去噪自动编码器准备训练集,请从图 2 中的训练集中删除蓝色和绿色特征。  55 .因为现在有些示例变得重复,所以只保留唯一的示例。

To prepare the training set for our denoising autoencoder, remove the blue and green features from the training set in fig. 55. Because now some examples become duplicates, keep only the unique ones.

在训练时,随机用零替换输入特征向量中的一些非零黄色特征。训练自动编码器来重建未损坏的输入。

At the training time, randomly replace some of the non-zero yellow features in the input feature vectors with zeros. Train the autoencoder to reconstruct the uncorrupted input.

在预测时,为用户构建特征向量。特征向量将包括未损坏的黄色特征以及手工制作的特征,例如X99x_{99}X100x_{100}。使用经过训练的 DAE 模型来重建未损坏的输入。向用户推荐模型输出得分最高的电影。

At prediction time, build a feature vector for the user. The feature vector will include uncorrupted yellow features as well as the handcrafted features like x99x_{99} and x100x_{100}. Use the trained DAE model to reconstruct the uncorrupted input. Recommend to the user movies that have the highest scores at the model’s output.

另一种有效的协同过滤模型是具有两个输入和一个输出的 FFNN。请记住第 8 章中的神经网络擅长处理多个同时输入。这里的训练示例是三元组𝐮,𝐦,r(\mathbf{u},\mathbf{m},r)。输入向量𝐮\mathbf{u}是用户的one-hot编码。第二个输入向量𝐦\mathbf{m}是电影的one-hot 编码。输出层可以是 sigmoid(在这种情况下,标签rr是在[0,1][0,1]) 或 ReLU,在这种情况下rr可以在一些典型的范围内,[1,5][1,5]例如。

Another effective collaborative-filtering model is an FFNN with two inputs and one output. Remember from Chapter 8 that neural networks are good at handling multiple simultaneous inputs. A training example here is a triplet (𝐮,𝐦,r)(\mathbf{u},\mathbf{m},r). The input vector 𝐮\mathbf{u} is a one-hot encoding of a user. The second input vector 𝐦\mathbf{m} is a one-hot encoding of a movie. The output layer could be either a sigmoid (in which case the label rr is in [0,1][0,1]) or ReLU, in which case rr can be in some typical range, [1,5][1,5] for example.

10.4自监督学习:词嵌入

10.4 Self-Supervised Learning: Word Embeddings

我们已经在第 7 章中讨论了词嵌入。回想一下,词嵌入是表示单词的特征向量。它们具有相似的单词具有相似的特征向量的特性。您可能想问的问题是这些词嵌入从何而来。答案是(再次):它们是从数据中学习的。

We have already discussed word embeddings in Chapter 7. Recall that word embeddings are feature vectors that represent words. They have the property that similar words have similar feature vectors. The question that you probably wanted to ask is where these word embeddings come from. The answer is (again): they are learned from data.

有很多算法可以学习词嵌入。在这里,我们只考虑其中之一:word2vec,并且只考虑 word2vec 的一种版本,称为skip-gram,它在实践中效果很好。许多语言的预训练 word2vec 嵌入可以在线下载。

There are many algorithms to learn word embeddings. Here, we consider only one of them: word2vec, and only one version of word2vec called skip-gram, which works well in practice. Pretrained word2vec embeddings for many languages are available to download online.

在词嵌入学习中,我们的目标是建立一个模型,可以使用该模型将单词的单热编码转换为词嵌入。让我们的词典包含 10,000 个单词。每个单词的 one-hot 向量是一个 10,000 维向量,除了包含11。不同的词有一个11在不同的维度。

In word embedding learning, our goal is to build a model which we can use to convert a one-hot encoding of a word into a word embedding. Let our dictionary contain 10,000 words. The one-hot vector for each word is a 10,000-dimensional vector of all zeroes except for one dimension that contains a 11. Different words have a 11 in different dimensions.

考虑一句话:“我几乎读完了机器学习的书。”现在,考虑我们删除了一个单词“书”的同一个句子。我们的句子变成:“我几乎读完了\cdot关于机器学习。”现在我们只保留前面的三个词\cdot后面三个字:“读完\cdot关于机器学习。”看着这七字窗周围\cdot,如果我让你猜什么\cdot你可能会说:“书”、“文章”或“论文”。这就是上下文单词如何让您预测它们周围的单词的方式。这也是机器如何得知单词“书”、“纸”和“文章”具有相似的含义:因为它们在多个文本中共享相似的上下文。

Consider a sentence: “I almost finished reading the book on machine learning.” Now, consider the same sentence from which we have removed one word, say “book.” Our sentence becomes: “I almost finished reading the \cdot on machine learning.” Now let’s only keep the three words before the \cdot and three words after: “finished reading the \cdot on machine learning.” Looking at this seven-word window around the \cdot, if I ask you to guess what \cdot stands for, you would probably say: “book,” “article,” or “paper.” That’s how the context words let you predict the word they surround. It’s also how the machine can learn that words “book,” “paper,” and “article” have a similar meaning: because they share similar contexts in multiple texts.

事实证明,反之亦然:一个词可以预测它周围的上下文。这篇文章“读完\cdot关于机器学习”被称为窗口大小为 7 (3 + 1 + 3) 的skip-gram。通过使用网络上可用的文档,我们可以轻松创建数亿个skip-gram。

It turns out that it works the other way around too: a word can predict the context that surrounds it. The piece “finished reading the \cdot on machine learning” is called a skip-gram with window size 7 (3 + 1 + 3). By using the documents available on the Web, we can easily create hundreds of millions of skip-grams.

让我们像这样表示一个skip-gram:[𝐱-3,𝐱-2,𝐱-1,𝐱,𝐱+1,𝐱+2,𝐱+3][\mathbf{x}_{-3}, \mathbf{x}_{-2}, \mathbf{x}_{-1}, \mathbf{x}, \mathbf{x}_{+1}, \mathbf{x}_{+2}, \mathbf{x}_{+3}]。在我们的句子中,𝐱-3\mathbf{x}_{-3}是“完成”的单热向量,𝐱-2\mathbf{x}_{-2}对应“阅读”,𝐱\mathbf{x}是被跳过的单词(\cdot),𝐱+1\mathbf{x}_{+1}是“开”等等。窗口大小为 5 的 Skip-gram 将如下所示:[𝐱-2,𝐱-1,𝐱,𝐱+1,𝐱+2][\mathbf{x}_{-2}, \mathbf{x}_{-1}, \mathbf{x}, \mathbf{x}_{+1}, \mathbf{x}_{+2}]

Let’s denote a skip-gram like this: [𝐱3,𝐱2,𝐱1,𝐱,𝐱+1,𝐱+2,𝐱+3][\mathbf{x}_{-3}, \mathbf{x}_{-2}, \mathbf{x}_{-1}, \mathbf{x}, \mathbf{x}_{+1}, \mathbf{x}_{+2}, \mathbf{x}_{+3}]. In our sentence, 𝐱3\mathbf{x}_{-3} is the one-hot vector for “finished,” 𝐱2\mathbf{x}_{-2} corresponds to “reading,” 𝐱\mathbf{x} is the skipped word (\cdot), 𝐱+1\mathbf{x}_{+1} is “on” and so on. A skip-gram with window size 5 will look like this: [𝐱2,𝐱1,𝐱,𝐱+1,𝐱+2][\mathbf{x}_{-2}, \mathbf{x}_{-1}, \mathbf{x}, \mathbf{x}_{+1}, \mathbf{x}_{+2}].

具有窗口大小的skip-gram模型55示意图如下:

The skip-gram model with window size 55 is schematically depicted below:

图 56:窗口大小为 5 且嵌入层为 300 个单元的 Skip-Gram 模型。
图 56:具有窗口大小的 Skip-Gram 模型55和嵌入层300300单位。

它是一个全连接网络,就像多层感知器一样。输入单词表示为\cdot在skip-gram中。神经网络必须学习在给定中心词的情况下预测skip-gram的上下文词。

It is a fully-connected network, like the multilayer perceptron. The input word is the one denoted as \cdot in the skip-gram. The neural network has to learn to predict the context words of the skip-gram given the central word.

现在您可以明白为什么这种学习被称为自我监督:标记的示例是从未标记的数据(例如文本)中提取的。

You can see now why the learning of this kind is called self-supervised: the labeled examples get extracted from the unlabeled data such as text.

输出层使用的激活函数是softmax。成本函数是负对数似然。当该单词的 one-hot 编码作为模型的输入时,该单词的嵌入将作为嵌入层的输出获得。

The activation function used in the output layer is softmax. The cost function is the negative log-likelihood. The embedding for a word is obtained as the output of the embedding layer when the one-hot encoding of this word is given as the input to the model.

由于 word2vec 模型中的参数数量较多,因此使用了两种技术来提高计算效率:分层 softmax(一种计算 softmax 的有效方法,即将 softmax 的输出表示为二叉树的叶子)和负采样(其中的想法只是更新每次梯度下降迭代的所有输出的随机样本)。我把这些留给进一步阅读。

Because of the large number of parameters in the word2vec models, two techniques are used to make the computation more efficient: hierarchical softmax (an efficient way of computing softmax that consists in representing the outputs of softmax as leaves of a binary tree) and negative sampling (where the idea is only to update a random sample of all outputs per iteration of gradient descent). I leave these for further reading.


  1. 更准确地说,我们会添加DD-1D(D-1)参数w,jw_{i,j}

  2. To be more precise we would add D(D1)D(D-1) parameters wi,jw_{i,j}.

  3. 符号\ll意思是“远小于”。

  4. The notation \ll means “much less than.”

11结论

11 Conclusion

哇,太快了!如果您来到这里并能够理解本书的大部分内容,那么您就真的很优秀了。

Wow, that was fast! You are really good if you got here and managed to understand most of the book’s material.

如果你看一下本页底部的数字,你会发现我已经超支了纸张,这意味着这本书的标题有点误导。我希望你能原谅我这个营销小伎俩。毕竟,如果我想让这本书正好有一百页,我可以减小字体大小、白边距和行距,或者删除 UMAP 部分,让你自己处理原始论文。相信我:您不会想独自一人面对 UMAP 上的原始论文! (只是在开玩笑。)

If you look at the number at the bottom of this page, you see that I have overspent paper, which means that the title of the book was slightly misleading. I hope that you forgive me for this little marketing trick. After all, if I wanted to make this book exactly a hundred pages, I could reduce font size, white margins, and line spacing, or remove the section on UMAP and leave you on your own with the original paper. Believe me: you would not want to be left on your own with the original paper on UMAP! (Just kidding.)

然而,现在停下来,我相信你已经具备了成为一名优秀的现代数据分析师或机器学习工程师所需的一切。这并不意味着我涵盖了所有内容,但我在一百多页中涵盖的内容你可以在一堆书中找到,每本书都有一千页厚。我所涵盖的大部分内容根本不在书中:典型的机器学习书籍是保守和学术的,而我强调了那些在日常工作中有用的算法和方法。

However, by stopping now, I feel confident that you have got everything you need to become a great modern data analyst or machine learning engineer. That doesn’t mean that I covered everything, but what I covered in a hundred+ pages you would find in a bunch of books, each a thousand pages thick. Much of what I covered is not in the books at all: typical machine learning books are conservative and academic, while I emphasized those algorithms and methods that you will find useful in your day to day work.

如果这是一本一千页的机器学习书,我到底会涵盖哪些内容?

What exactly would I have covered if it was a thousand-page machine learning book?

11.1未涵盖的内容

11.1 What Wasn’t Covered

11.1.1主题建模

11.1.1 Topic Modeling

在文本分析中,主题建模是一个普遍存在的无监督学习问题。您有一组文本文档,并且您希望发现每个文档中存在的主题。潜在狄利克雷分配(LDA)是一种非常有效的主题发现算法。您可以决定文档集合中存在多少个主题,算法会为该集合中的每个单词分配一个主题。然后,要从文档中提取主题,您只需计算该文档中存在每个主题的单词数即可。

In text analysis, topic modeling is a prevalent unsupervised learning problem. You have a collection of text documents, and you would like to discover topics present in each document. Latent Dirichlet Allocation (LDA) is a very effective algorithm of topic discovery. You decide how many topics are present in your collection of documents and the algorithm assigns a topic to each word in this collection. Then, to extract the topics from a document, you simply count how many words of each topic are present in that document.

11.1.2高斯过程

11.1.2 Gaussian Processes

高斯过程(GP)是一种与核回归竞争的监督学习方法。它比后者有一些优点。例如,它提供每个点的回归线的置信区间。我决定不解释 GP,因为我找不到简单的方法来解释它们,但你绝对可以花一些时间来了解 GP。这会是一个值得花的时间。

Gaussian processes (GP) is a supervised learning method that competes with kernel regression. It has some advantages over the latter. For example, it provides confidence intervals for the regression line in each point. I decided not to explain GP because I could not figure out a simple way to explain them, but you definitely could spend some time to learn about GP. It will be time well spent.

11.1.3广义线性模型

11.1.3 Generalized Linear Models

广义线性模型(GLM) 是线性回归的推广,用于对输入特征向量和目标之间的各种形式的依赖性进行建模。例如,逻辑回归就是 GLM 的一种形式。如果您对回归感兴趣并且正在寻找简单且可解释的模型,那么您绝对应该阅读有关 GLM 的更多内容。

Generalized Linear Model (GLM) is a generalization of the linear regression to modeling various forms of dependency between the input feature vector and the target. Logistic regression, for instance, is one form of GLMs. If you are interested in regression and you look for simple and explainable models, you should definitely read more on GLM.

11.1.4概率图形模型

11.1.4 Probabilistic Graphical Models

我在第 7 章中提到了概率图模型(PGM) 的一个例子:条件随机场(CRF)。使用 CRF,您可以将单词的输入序列以及该序列中的特征和标签之间的关系建模为顺序依赖图。更一般地,PGM 可以是任何图。图是由节点和边的集合组成的结构,每个节点和边连接一对节点 PGM中的每个节点代表一些随机变量(其值可以被观察或不可观察),边代表一个随机变量对另一个随机变量的条件依赖性。例如,随机变量“人行道湿度”取决于随机变量“天气状况”。通过观察一些随机变量的值,优化算法可以从数据中了解观察到的变量和未观察到的变量之间的依赖性。

I have mentioned one example of probabilistic graphical models (PGMs) in Chapter 7: conditional random fields (CRF). With CRF you can model the input sequence of words and relationships between the features and labels in this sequence as a sequential dependency graph. More generally, a PGM can be any graph. A graph is a structure consisting of a collection of nodes and edges that each join a pair of nodes. Each node in PGM represents some random variable (values of which can be observed or unobserved), and edges represent the conditional dependence of one random variable on another random variable. For example, the random variable “sidewalk wetness” depends on the random variable “weather condition.” By observing values of some random variables, an optimization algorithm can learn from data the dependency between observed and unobserved variables.

PGM 允许数据分析师了解一个特征的值如何依赖于其他特征的值。如果依赖图的边是有向的,就可以推断因果关系。不幸的是,手动构建此类模型需要大量的领域专业知识以及对概率论和统计学的深刻理解。后者常常是许多领域专家面临的问题。一些算法可以从数据中学习依赖图的结构,但学习到的模型通常很难被人类解释,因此它们不利于理解生成数据的复杂概率过程。 CRF 是迄今为止最常用的 PGM,主要应用于文本和图像处理。然而,在这两个领域,它们都被神经网络超越了。另一种图形模型,隐马尔可夫模型或 HMM,过去经常用于语音识别、时间序列分析和其他时间推理任务,但是,HMM 再次输给了神经网络。

PGMs allow the data analyst to see how the values of one feature depend on the values of other features. If the edges of the dependency graph are directed, it becomes possible to infer causality. Unfortunately, constructing such models by hand requires a substantial amount of domain expertise and a strong understanding of probability theory and statistics. The latter is often a problem for many domain experts. Some algorithms can learn the structure of dependency graphs from data, but the learned models are often hard to interpret by a human and thus they aren’t beneficial for understanding complex probabilistic processes that generated the data. CRF is by far the most used PGM with applications mostly in text and image processing. However, in these two domains, they were surpassed by neural networks. Another graphical model, Hidden Markov Model or HMM, in the past, was frequently used in speech recognition, time series analysis, and other temporal inference tasks, but, again HMM lost to neural networks.

如果您仍然决定了解有关 PGM 的更多信息,它们也称为贝叶斯网络、信念网络和概率独立网络。

If you still decide to learn more about PGMs, they are also known as Bayesian networks, belief networks, and probabilistic independence networks.

11.1.5马尔可夫链蒙特卡罗

11.1.5 Markov Chain Monte Carlo

如果您使用图形模型并希望从依赖图定义的非常复杂的分布中采样示例,则可以使用马尔可夫链蒙特卡罗(MCMC) 算法。 MCMC 是一类从数学定义的概率分布中进行采样的算法。请记住,当我们谈论去噪自动编码器时,我们从正态分布中采样噪声。从标准分布(例如正态分布或均匀分布)中采样相对容易,因为它们的属性是众所周知的。然而,当概率分布可以具有由复杂公式定义的任意形式时,采样任务变得更加复杂。

If you work with graphical models and want to sample examples from a very complex distribution defined by the dependency graph, you could use Markov Chain Monte Carlo (MCMC) algorithms. MCMC is a class of algorithms for sampling from any probability distribution defined mathematically. Remember that when we talked about denoising autoencoders, we sampled noise from the normal distribution. Sampling from standard distributions, such as normal or uniform, is relatively easy because their properties are well known. However, the task of sampling becomes significantly more complicated when the probability distribution can have an arbitrary form defined by a complex formula.

11.1.6生成对抗网络

11.1.6 Generative Adversarial Networks

生成对抗网络(GAN)是一类用于无监督学习的神经网络。它们被实现为两个神经网络系统,在零和游戏设置中相互竞争。 GAN 最流行的应用是学习生成对于人类观察者来说看起来真实的照片。两个网络中的第一个采用随机输入(通常是高斯噪声)并学习生成像素矩阵形式的图像。第二个网络将两个图像作为输入:来自某些图像集合的一个“真实”图像以及第一个网络生成的图像。第二个网络必须学会识别两个图像中的哪一个是由第一个网络生成的。如果第二个网络识别出“假”图像,第一个网络会得到负损失。另一方面,如果第二个网络无法识别两张图像中哪一张是假的,它就会受到惩罚。

Generative adversarial networks, or GANs, are a class of neural networks used in unsupervised learning. They are implemented as a system of two neural networks contesting with each other in a zero-sum game setting. The most popular application of GANs is to learn to generate photographs that look authentic to human observers. The first of the two networks takes a random input (typically Gaussian noise) and learns to generate an image as a matrix of pixels. The second network takes as input two images: one “real” image from some collection of images as well as the image generated by the first network. The second network has to learn to recognize which one of the two images was generated by the first network. The first network gets a negative loss if the second network recognizes the “fake” image. The second network, on the other hand, gets penalized if it fails to recognize which one of the two images is fake.

11.1.7遗传算法

11.1.7 Genetic Algorithms

遗传算法(GA)是一种数值优化技术,用于优化不可微的优化目标函数。他们使用进化生物学的概念,通过模仿进化生物过程来搜索优化问题的全局最优值(最小值或最大值)。

Genetic algorithms (GA) are a numerical optimization technique used to optimize undifferentiable optimization objective functions. They use concepts from evolutionary biology to search for a global optimum (minimum or maximum) of an optimization problem by mimicking evolutionary biological processes.

GA 的工作方式是从第一代候选解决方案开始。如果我们寻找模型参数的最佳值,我们首先随机生成参数值的多种组合。然后,我们根据目标函数测试参数值的每个组合。将参数值的每个组合想象为多维空间中的一个点。然后,我们通过应用“选择”、“交叉”和“变异”等概念,从上一代点生成下一代点。

GA work by starting with an initial generation of candidate solutions. If we look for optimal values of the parameters of our model, we first randomly generate multiple combinations of parameter values. We then test each combination of parameter values against the objective function. Imagine each combination of parameter values as a point in a multi-dimensional space. We then generate a subsequent generation of points from the previous generation by applying such concepts as “selection,” “crossover,” and “mutation.”

简而言之,这会导致每一代新人都保留更多与上一代中在目标上表现最佳的分数类似的分数。在新一代中,上一代中表现最差的点被表现最好的点的“变异”和“交叉”所取代。点的突变是通过原始点的某些属性的随机扭曲而获得的。交叉是几个点的某种组合(例如平均值)。

In a nutshell, that results in each new generation keeping more points similar to those points from the previous generation that performed the best against the objective. In the new generation, the points that performed the worst in the previous generation are replaced by “mutations” and “crossovers” of the points that performed the best. A mutation of a point is obtained by a random distortion of some attributes of the original point. A crossover is a certain combination of several points (for example, an average).

遗传算法允许找到任何可测量的优化标准的解决方案。例如,遗传算法可用于优化学习算法的超参数。它们通常比基于梯度的优化技术慢得多。

Genetic algorithms allow the finding of solutions to any measurable optimization criteria. For example, GA can be used to optimize the hyperparameters of a learning algorithm. They are typically much slower than gradient-based optimization techniques.

11.1.8强化学习

11.1.8 Reinforcement Learning

正如我们已经讨论过的,强化学习(RL) 解决了一种非常具体的问题,其中决策是顺序的。通常,有一个代理在未知的环境中行动。每个动作都会带来奖励,并将代理移动到环境的另一种状态(通常是某些具有未知属性的随机过程的结果)。代理的目标是优化其长期奖励。

As we already discussed, reinforcement learning (RL) solves a very specific kind of problem where the decision making is sequential. Usually, there’s an agent acting in an unknown environment. Each action brings a reward and moves the agent to another state of the environment (usually, as a result of some random process with unknown properties). The goal of the agent is to optimize its long-term reward.

强化学习算法,例如 Q-learning 及其基于神经网络的算法,用于学习玩视频游戏、机器人导航和协调、库存和供应链管理、复杂电力系统(电网)的优化以及学习的金融交易策略。

Reinforcement learning algorithms, such as Q-learning, and their neural network based counterparts are used in learning to play video games, robotic navigation and coordination, inventory and supply chain management, optimization of complex electric power systems (power grids), and the learning of financial trading strategies.


书到这里就停了。不要忘记偶尔访问本书的配套 wiki,以随时了解本书中考虑的每个机器学习领域的新发展。正如我在前言中所说,这本书得益于不断更新的wiki,就像好酒在你买了之后变得越来越好一样。

The book stops here. Don’t forget to occasionally visit the book’s companion wiki to stay updated on new developments in each machine learning area considered in the book. As I said in the Preface, this book, thanks to the constantly updated wiki, like a good wine keeps getting better after you buy it.

哦,别忘了这本书是按照先读后买的原则发行的。这意味着,如果您在阅读这些文字时看到数字屏幕上的文本,并且不记得是否已付费购买该书,那么您可能是购买这本书的合适人选。

Oh, and don’t forget that the book is distributed on the read first, buy later principle. That means that if while reading these words you look at text on a digital screen and cannot remember having paid to get it, you are probably the right person for buying the book.

11.2致谢

11.2 Acknowledgements

如果没有志愿编辑,这本书就不可能有如此高的质量。我特别感谢以下读者的系统贡献:Martijn van Attekum、Daniel Maraini、Ali Aziz、Rachel Mak、Kelvin Sundli 和 John Robinson。

The high quality of this book would be impossible without volunteering editors. I especially thank the following readers for their systematic contributions: Martijn van Attekum, Daniel Maraini, Ali Aziz, Rachel Mak, Kelvin Sundli, and John Robinson.

我要感谢的其他优秀人士的帮助包括 Michael Anuzis、Knut Sverdrup、Freddy Drennan、Carl W. Handlin、Abhijit Kumar、Lasse Vetter、Ricardo Reis、Daniel Gross、Johann Faouzi、Akash Agrawal、Nathanael Weill、Filip Jekic、 Abhishek Babuji、Luan Vieira、Sayak Paul、Vaheid Wallets、Lorenzo Buffoni、Eli Friedman、Łukasz Mądry、秦浩兰、Bibek Behera、Jennifer Cooper、Nishant Tyagi、Denis Akhiyarov、Aron Janarv、Alexander Ovcharenko、Ricardo Rios、Michael Mullen、Matthew Edwards , 大卫·埃特林, 马诺杰·巴拉吉·J, 大卫·罗伊, 卢安·维埃拉, 路易斯·菲利克斯, 阿南德·莫汉, 哈迪·索图德, 查理·纽维, 扎米尔·阿基姆别科夫, 赫苏斯·雷内罗, 卡兰·加迪亚, 穆斯塔法·阿尼尔·杰尔宾特, JQ Veenstra, Zsolt Kreisz, 伊恩·凯利, 卢卡斯·扎瓦达、玛格达·科瓦尔斯卡、西尔万·普罗诺沃斯特、罗伯特·韦勒姆、托马斯·博斯曼、Lv Steven、阿里尔·罗桑尼戈和卢西亚诺·塞古拉。

Other wonderful people to whom I am grateful for their help are Michael Anuzis, Knut Sverdrup, Freddy Drennan, Carl W. Handlin, Abhijit Kumar, Lasse Vetter, Ricardo Reis, Daniel Gross, Johann Faouzi, Akash Agrawal, Nathanael Weill, Filip Jekic, Abhishek Babuji, Luan Vieira, Sayak Paul, Vaheid Wallets, Lorenzo Buffoni, Eli Friedman, Łukasz Mądry, Haolan Qin, Bibek Behera, Jennifer Cooper, Nishant Tyagi, Denis Akhiyarov, Aron Janarv, Alexander Ovcharenko, Ricardo Rios, Michael Mullen, Matthew Edwards, David Etlin, Manoj Balaji J, David Roy, Luan Vieira, Luiz Felix, Anand Mohan, Hadi Sotudeh, Charlie Newey, Zamir Akimbekov, Jesus Renero, Karan Gadiya, Mustafa Anıl Derbent, JQ Veenstra, Zsolt Kreisz, Ian Kelly, Lukasz Zawada, Magda Kowalska, Sylvain Pronovost, Robert Wareham, Thomas Bosman, Lv Steven, Ariel Rossanigo and Luciano Segura.